scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2007"


Proceedings ArticleDOI
04 Jun 2007
TL;DR: The paper provides theoretical evidence that insertion of a new data point as well as deletion of an old data point influence only limited number of their closest neighbors and thus the number of updates per such insertion/deletion does not depend on the total number of points in the data set.
Abstract: Outlier detection has recently become an important problem in many industrial and financial applications. This problem is further complicated by the fact that in many cases, outliers have to be detected from data streams that arrive at an enormous pace. In this paper, an incremental LOF (local outlier factor) algorithm, appropriate for detecting outliers in data streams, is proposed. The proposed incremental LOF algorithm provides equivalent detection performance as the iterated static LOF algorithm (applied after insertion of each data record), while requiring significantly less computational time. In addition, the incremental LOF algorithm also dynamically updates the profiles of data points. This is a very important property, since data profiles may change over time. The paper provides theoretical evidence that insertion of a new data point as well as deletion of an old data point influence only limited number of their closest neighbors and thus the number of updates per such insertion/deletion does not depend on the total number of points TV in the data set. Our experiments performed on several simulated and real life data sets have demonstrated that the proposed incremental LOF algorithm is computationally efficient, while at the same time very successful in detecting outliers and changes of distributional behavior in various data stream applications

397 citations


Journal ArticleDOI
TL;DR: A general purpose method called conditional anomaly detection for taking differences among attributes into account, and three different expectation-maximization algorithms for learning the model that is used in conditional anomalies detection are proposed.
Abstract: When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most critical task. Rather, it is often more important to make sure that those anomalies that are reported to the user are in fact interesting. If too many unremarkable data points are returned to the user labeled as candidate anomalies, the software can soon fall into disuse. One way to ensure that returned anomalies are useful is to make use of domain knowledge provided by the user. Often, the data in question includes a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. This paper describes a general purpose method called conditional anomaly detection for taking such differences among attributes into account, and proposes three different expectation-maximization algorithms for learning the model that is used in conditional anomaly detection. Experiments with more than 13 different data sets compare our algorithms with several other more standard methods for outlier or anomaly detection

331 citations


Journal ArticleDOI
TL;DR: In this paper some basic concepts of robust techniques are presented and their usefulness in chemometric data analysis is stressed.

321 citations


Journal ArticleDOI
TL;DR: Results from controlled testing show that significant improvement is achieved by using the proposed model in terms of both reducing the magnitude of observational residuals as well as the three-dimensional positioning accuracy of signalised points.
Abstract: A rigorous method for terrestrial laser scanner self-calibration using a network of signalised points is presented. Exterior orientation, object point co-ordinates and additional parameters are estimated simultaneously by free network adjustment. Spherical co-ordinate observation equations are augmented with a set of additional parameters that model systematic errors in range, horizontal direction and elevation angle. The error models include both physically interpretable and empirically identified components. Though the focus is on one particular make and model of AM–CW scanner system, the Faro 880, the mathematical models are formulated in a general framework so their application to other instruments only requires selection of an appropriate set of additional parameters. Results from controlled testing show that significant improvement is achieved by using the proposed model in terms of both reducing the magnitude of observational residuals as well as the three-dimensional positioning accuracy of signalised points. Ten self-calibration datasets captured over the course of 13 months are used to examine short- and long-term additional parameter stability via standard hypothesis testing techniques. Detailed investigations into correlation mechanisms between model parameters accompany the self-calibration solution analyses. Other contributions include an observation model for incorporation of integrated inclinometer observations into the self-calibration solution and an effective a priori outlier removal method. The benefit of the former is demonstrated to be reduced correlation between exterior orientation and additional parameters, even if inclinometer precision is low. The latter is arrived at by detailed analysis of the influence of incidence angle on range.

233 citations


Proceedings ArticleDOI
06 Nov 2007
TL;DR: In this work a method for detecting distance-based outliers in data streams is presented, where outlier queries are performed in order to detect anomalies in the current window using the sliding window model.
Abstract: In this work a method for detecting distance-based outliers in data streams is presented. We deal with the sliding window model, where outlier queries are performed in order to detect anomalies in the current window. Two algorithms are presented. The first one exactly answers outlier queries, but has larger space requirements. The second algorithm is directly derived from the exact one, has limited memory requirements and returns an approximate answer based on accurate estimations with a statistical guarantee. Several experiments have been accomplished, confirming the effectiveness of the proposed approach and the high quality of approximate solutions.

228 citations


Book ChapterDOI
18 Jul 2007
TL;DR: A novel unsupervised algorithm for outlier detection with a solid statistical foundation is proposed, modifying a nonparametric density estimate with a variable kernel to yield a robust local density estimation.
Abstract: Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel unsupervised algorithm for outlier detection with a solid statistical foundation is proposed. First we modify a nonparametric density estimate with a variable kernel to yield a robust local density estimation. Outliers are then detected by comparing the local density of each point to the local density of its neighbors. Our experiments performed on several simulated data sets have demonstrated that the proposed approach can outperform two widely used outlier detection algorithms (LOF and LOCI).

225 citations


Journal ArticleDOI
TL;DR: In this paper, the authors developed a procedure to overcome the problem of non-identifiability of distributed parameters by introducing aggregate parameters and using Bayesian inference, and they demonstrated the good performance of this approach to uncertainty analysis, particularly with respect to the fulfilment of statistical assumptions of the error model.

221 citations


Journal ArticleDOI
TL;DR: Two variations of a method that uses the median from a neighborhood of a data point and a threshold value to compare the difference between the median and the observed data value are proposed.
Abstract: In this article we consider the problem of detecting unusual values or outliers from time series data where the process by which the data are created is difficult to model. The main consideration is the fact that data closer in time are more correlated to each other than those farther apart. We propose two variations of a method that uses the median from a neighborhood of a data point and a threshold value to compare the difference between the median and the observed data value. Both variations of the method are fast and can be used for data streams that occur in quick succession such as sensor data on an airplane.

219 citations


Journal ArticleDOI
TL;DR: In this paper, a robust projection-pursuit-based method for principal component analysis (PCA) is proposed for the analysis of chemical data, where the number of variables is typically large.

207 citations


Proceedings ArticleDOI
09 Sep 2007
TL;DR: A histogram-based method for outlier detection to reduce communication cost by collecting hints (in the form of a histogram) about the data distribution, and using the hints to filter out unnecessary data and identify potential outliers.
Abstract: Outlier detection has many important applications in sensor networks, e.g., abnormal event detection, animal behavior change, etc. It is a difficult problem since global information about data distributions must be known to identify outliers. In this paper, we use a histogram-based method for outlier detection to reduce communication cost. Rather than collecting all the data in one location for centralized processing, we propose collecting hints (in the form of a histogram) about the data distribution, and using the hints to filter out unnecessary data and identify potential outliers. We show that this method can be used for detecting outliers in terms of two different definitions. Our simulation results show that the histogram method can dramatically reduce the communication cost.

195 citations


Proceedings ArticleDOI
12 Aug 2007
TL;DR: An alternative definition of anomalies is presented, and an approach of comparing against marginal distribution of attribute subsets is proposed that has a better performance over semi-synthetic as well as real world datasets.
Abstract: We consider the problem of detecting anomalies in high aritycategorical datasets. In most applications, anomalies are defined as datapoints that are "abnormal". Quite often we have access to data which consists mostly of normal records, a long with a small percentage of unlabelled anomalous records. We are interested in the problem of unsupervised anomaly detection, where we use the unlabelled data for training, and detect records that do not follow the definition of normality.A standard approach is to create a model of normal data, and compare test records against it. A probabilistic approach builds a likelihood model from the training data. Records are tested for anomalies based on the complete record likelihood given the probability model. For categorical attributes, bayes nets give a standard representation of the likelihood. While this approach is good at finding outliers in the dataset, it often tends to detect records with attribute values that are rare. Sometimes, just detecting rare values of an attribute is not desired and such outliers are not considered as anomalies in that context. We present an alternative definition of anomalies, and propose an approach of comparing against marginal distribution of attribute subsets. We show that this is a more meaningful way of detecting anomalies, and has a better performance over semi-synthetic as well as real world datasets.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed TAF-SVM is superior to SVM in terms of the face-recognition accuracy and can achieve smaller error variances than SVM over a number of tests such that better recognition stability can be obtained.
Abstract: This paper presents a new classifier called total margin-based adaptive fuzzy support vector machines (TAF-SVM) that deals with several problems that may occur in support vector machines (SVMs) when applied to the face recognition. The proposed TAF-SVM not only solves the overfitting problem resulted from the outlier with the approach of fuzzification of the penalty, but also corrects the skew of the optimal separating hyperplane due to the very imbalanced data sets by using different cost algorithm. In addition, by introducing the total margin algorithm to replace the conventional soft margin algorithm, a lower generalization error bound can be obtained. Those three functions are embodied into the traditional SVM so that the TAF-SVM is proposed and reformulated in both linear and nonlinear cases. By using two databases, the Chung Yuan Christian University (CYCU) multiview and the facial recognition technology (FERET) face databases, and using the kernel Fisher's discriminant analysis (KFDA) algorithm to extract discriminating face features, experimental results show that the proposed TAF-SVM is superior to SVM in terms of the face-recognition accuracy. The results also indicate that the proposed TAF-SVM can achieve smaller error variances than SVM over a number of tests such that better recognition stability can be obtained

Journal ArticleDOI
TL;DR: Experimental results indicate that the proposed method reduces the affect of outliers and yields higher classification rate than standard SVM does when outliers exist in the training data set.
Abstract: This paper presents a weighted support vector machine (WSVM) to improve the outlier sensitivity problem of standard support vector machine (SVM) for two-class data classification. The basic idea is to assign different weights to different data points such that the WSVM training algorithm learns the decision surface according to the relative importance of data points in the training data set. The weights used in WSVM are generated by a robust fuzzy clustering algorithm, kernel-based possibilistic c-means (KPCM) algorithm, whose partition generates relative high values for important data points but low values for outliers. Experimental results indicate that the proposed method reduces the effect of outliers and yields higher classification rate than standard SVM does when outliers exist in the training data set.

Journal ArticleDOI
TL;DR: This article considers the problem of building a linear prediction model when the number of candidate predictors is large and the data possibly contain anomalies that are difficult to visualize and clean and introduces two different approaches to robustify LARS.
Abstract: In this article we consider the problem of building a linear prediction model when the number of candidate predictors is large and the data possibly contain anomalies that are difficult to visualize and clean. We want to predict the nonoutlying cases; therefore, we need a method that is simultaneously robust and scalable. We consider the stepwise least angle regression (LARS) algorithm which is computationally very efficient but sensitive to outliers. We introduce two different approaches to robustify LARS. The plug-in approach replaces the classical correlations in LARS by robust correlation estimates. The cleaning approach first transforms the data set by shrinking the outliers toward the bulk of the data (which we call multivariate Winsorization) and then applies LARS to the transformed data. We show that the plug-in approach is time-efficient and scalable and that the bootstrap can be used to stabilize its results. We recommend using bootstrapped robustified LARS to sequence a number of candidate pred...

Journal ArticleDOI
TL;DR: A method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples, which can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples.
Abstract: We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).

Journal ArticleDOI
TL;DR: An EM-based algorithm is developed for the fitting of mixtures of t-factor analyzers and its application is demonstrated in the clustering of some microarray gene-expression data.

Proceedings ArticleDOI
21 Aug 2007
TL;DR: A new distance measure is formalized, fractional root mean squared distance (FRMSD), which incorporates the fraction of inliers into the distance function and is guaranteed to converge to a locally optimal solution.
Abstract: We describe a variation of the iterative closest point (ICP) algorithm for aligning two point sets under a set of transformations. Our algorithm is superior to previous algorithms because (1) in determining the optimal alignment, it identifies and discards likely outliers in a statistically robust manner, and (2) it is guaranteed to converge to a locally optimal solution. To this end, we formalize a new distance measure, fractional root mean squared distance (FRMSD), which incorporates the fraction of inliers into the distance function. Our framework can easily incorporate most techniques and heuristics from modern registration algorithms. We experimentally validate our algorithm against previous techniques on 2 and 3 dimensional data exposed to a variety of outlier types.

Proceedings ArticleDOI
10 Dec 2007
TL;DR: A modified Kalman filter is introduced that can perform robust, real-time outlier detection in the observations, without the need for manual parameter tuning by the user, using a weighted least squares-like approach.
Abstract: In this paper, we introduce a modified Kalman filter that can perform robust, real-time outlier detection in the observations, without the need for manual parameter tuning by the user. Robotic systems that rely on high quality sensory data can be sensitive to data containing outliers. Since the standard Kalman filter is not robust to outliers, other variations of the Kalman filter have been proposed to overcome this issue, but these methods may require manual parameter tuning, use of heuristics or complicated parameter estimation. Our Kalman filter uses a weighted least squares-like approach by introducing weights for each data sample. A data sample with a smaller weight has a weaker contribution when estimating the current time step's state. We learn the weights and system dynamics using a variational Expectation-Maximization framework. We evaluate our Kalman filter algorithm on data from a robotic dog.

Journal ArticleDOI
TL;DR: In this article, the robustness properties of several linear estimators when multiple outliers are possibly present in the sample are investigated. But the robust properties of linear estimator are not discussed.
Abstract: In this paper, we consider order statistics and outlier models, and focus primarily on multiple-outlier models and associated robustness issues. We first synthesise recent developments on order statistics arising from independent and non-identically distributed random variables based primarily on the theory of permanents. We then highlight various applications of these results in evaluating the robustness properties of several linear estimators when multiple outliers are possibly present in the sample.

Journal ArticleDOI
TL;DR: This paper proposed a global semiparametric quantile regression model that has the ability to estimate conditional quantiles without the usual distributional assumptions, and developed a new model assessment tool for longitudinal growth data.
Abstract: Growth charts are often more informative when they are customized per subject, taking into account prior measurements and possibly other covariates of the subject. We study a global semiparametric quantile regression model that has the ability to estimate conditional quantiles without the usual distributional assumptions. The model can be estimated from longitudinal reference data with irregular measurement times and with some level of robustness against outliers, and it is also flexible for including covariate information. We propose a rank score test for large sample inference on covariates, and develop a new model assessment tool for longitudinal growth data. Our research indicates that the global model has the potential to be a very useful tool in conditional growth chart analysis.

Journal ArticleDOI
15 Apr 2007-Talanta
TL;DR: It is shown that the proposed strategy for dealing with missing values and outlying observations simultaneously in principal component analysis works well for highly contaminated data containing different amounts of missing elements.

Journal ArticleDOI
TL;DR: In this article, a continuous time autoregressive error model was proposed for statistical inference and uncertainty analysis in hydrologic modeling and applied to the Thur River basin in Switzerland, subject to completely different climatic conditions.
Abstract: [1] Calibration and uncertainty analysis in hydrologic modeling are affected by measurement errors in input and response and errors in model structure. Recently, extending similar approaches in discrete time, a continuous time autoregressive error model was proposed for statistical inference and uncertainty analysis in hydrologic modeling. The major advantages over discrete time formulation are the use of a continuous time error model for describing continuous processes, the possibility of accounting for seasonal variations of parameters in the error model, the easier treatment of missing data or omitted outliers, and the opportunity for continuous time predictions. The model was developed for the Chaohe Basin in China and had some features specific for this semiarid climatic region (in particular, the seasonal variation of parameters in the error model in response to seasonal variation in precipitation). This paper tests and extends this approach with an application to the Thur River basin in Switzerland, which is subject to completely different climatic conditions. This application corroborates the general applicability of the approach but also demonstrates the necessity of accounting for the heavy tails in the distributions of residuals and innovations. This is done by replacing the normal distribution of the innovations by a Student t distribution, the degrees of freedom of which are adapted to best represent the shape of the empirical distribution of the innovations. We conclude that with this extension, the continuous time autoregressive error model is applicable and flexible for hydrologic modeling under different climatic conditions. The major remaining conceptual disadvantage is that this class of approaches does not lead to a separate identification of model input and model structural errors. The major practical disadvantage is the high computational demand characteristic for all Markov chain Monte Carlo techniques.

Journal ArticleDOI
TL;DR: The aim of this work is to study robust regression techniques in the fixed effects linear panel data framework by means of breakdown point computations and simulation experiments, and to show the potential of robust panel data methods.
Abstract: Panel data estimators can be strongly biased in the presence of outlying observations. Although most researchers are aware of this problem, little literature is existing on robust estimation of the parameters in a panel data model. In this paper, robust versions of the classical Within Group estimator are considered. The robustness of these estimators with respect to outliers will be investigated. The presence of outliers can lead to erroneous estimates in regression models. Indeed, the classical least-squares (LS) approach is known to be very sensitive to outliers. Moreover, outliers are not always detectable by looking at residuals from a LS fit, since the latter suffers from the masking effect. Masking means here that outliers affect the LS estimator in such a way that outlier diagnostics based on LS are not capable of detecting them anymore. Note also that diagnostic measures like the Cook Distance suffer from the masking effect, as soon as multiple outliers are present. More robust alternatives to LS are the Least Absolute Deviation estimator and M-estimators. Unfortunately, these estimators are not robust with respect to leverage points, i.e. outliers in the space of the covariates. Thus, regression estimators having a high breakdown

Posted Content
TL;DR: The authors conducted an extensive robustness analysis of the relationship between trust and growth by investigating a later time period and a bigger sample than in previous studies, and found that when outliers (especially China) are removed, the trust-growth relationship is no longer robust.
Abstract: We conduct an extensive robustness analysis of the relationship between trust and growth by investigating a later time period and a bigger sample than in previous studies. In addition to robustness tests that focus on model uncertainty, we systematize the investigation of outlier influence on the results by using the robust estimation technique Least Trimmed Squares. We find that when outliers (especially China) are removed, the trust-growth relationship is no longer robust. On average, the trust coefficient is half as large as in previous findings.

Book ChapterDOI
09 Sep 2007
TL;DR: This method is shown to outperform a current state-of-the-art incremental one-class learning algorithm (Incremental SVDD) on a variety of datasets, while requiring only an upper limit on model complexity to be specified.
Abstract: An incremental one-class learning algorithm is proposed for the purpose of outlier detection. Outliers are identified by estimating - and thresholding - the probability distribution of the training data. In the early stages of training a non-parametric estimate of the training data distribution is obtained using kernel density estimation. Once the number of training examples reaches the maximum computationally feasible limit for kernel density estimation, we treat the kernel density estimate as a maximally-complex Gaussian mixture model, and keep the model complexity constant bymerging a pair of components for each newkernel added. This method is shown to outperform a current state-of-the-art incremental one-class learning algorithm (Incremental SVDD [5]) on a variety of datasets, while requiring only an upper limit on model complexity to be specified.

Proceedings ArticleDOI
Shipeng Yu1, Volker Tresp, Kai Yu
20 Jun 2007
TL;DR: A robust framework for Bayesian multitask learning, t-processes (TP), which are a generalization of Gaussian processes for multi-task learning, which allows the system to effectively distinguish good tasks from noisy or outlier tasks.
Abstract: Most current multi-task learning frameworks ignore the robustness issue, which means that the presence of "outlier" tasks may greatly reduce overall system performance. We introduce a robust framework for Bayesian multitask learning, t-processes (TP), which are a generalization of Gaussian processes (GP) for multi-task learning. TP allows the system to effectively distinguish good tasks from noisy or outlier tasks. Experiments show that TP not only improves overall system performance, but can also serve as an indicator for the "informativeness" of different tasks.

Journal ArticleDOI
TL;DR: This work proposes the outlier robust t-statistic (ORT), which is intuitively motivated from the t-Statistic, the most commonly used differential gene expression detection method for cancer genes that are over- or down-expressed in some but not all samples in a disease group.
Abstract: SUMMARY We study statistical methods to detect cancer genes that are over- or down-expressed in some but not all samples in a disease group. This has proven useful in cancer studies where oncogenes are activated only in a small subset of samples. We propose the outlier robust t-statistic (ORT), which is intuitively motivated from thet-statistic, the most commonly used differential gene expression detection method. Using real and simulation studies, we compare the ORT to the recently proposed cancer outlier profile analysis (Tomlins and others, 2005) and the outlier sum statistic of Tibshirani and Hastie (2006). The proposed method often has more detection power and smaller false discovery rates. Supplementary information can be found at http://www.biostat.umn.edu/∼baolin/research/ort.html.

Journal ArticleDOI
TL;DR: A nonparametric approach for neuroimaging data analysis that is based on the rank-order of data that may offer a small benefit for datasets where the assumptions of the t-test have been violated, for example datasets where data from one of the groups exhibits a skewed distribution due to floor or ceiling effects.

Journal ArticleDOI
TL;DR: This paper proposes an efficient scheme for using image examples as driving a powerful regularization, applied to the image scale-up (super-resolution) problem, and demonstrates the algorithm on several scanned documents with promising results.
Abstract: Regularization plays a vital role in inverse problems, and especially in ill-posed ones. Along with classical regularization techniques based on smoothness, entropy, and sparsity, an emerging powerful regularization is one that leans on image examples. In this paper, we propose an efficient scheme for using image examples as driving a powerful regularization, applied to the image scale-up (super-resolution) problem. In this work, we target specifically scanned documents containing written text, graphics, and equations. Our algorithm starts by assigning per each location in the degraded image several candidate high-quality patches. Those are found as the nearest-neighbors (NN) in an image-database that contains pairs of corresponding low- and high-quality image patches. The found examples are used for the definition of an image prior expression, merged into a global MAP penalty function. We use this penalty function both for rejecting some of the irrelevant outlier examples, and then for reconstructing the desired image. We demonstrate our algorithm on several scanned documents with promising results.

Proceedings ArticleDOI
Kyung-A Yoon1, Ohsung Kwon1, Doo-Hwan Bae1
20 Sep 2007
TL;DR: This work proposes an approach to outlier detection of software measurement data using the k-means clustering method in this work, which helps in the detection of the outlier which reduces the data quality during the software measurement implementation.
Abstract: The quality of software measurement data affects the accuracy of project manager's decision making using estimation or prediction models and the understanding of real project status. During the software measurement implementation, the outlier which reduces the data quality is collected, however its detection is not easy. To cope with this problem, we propose an approach to outlier detection of software measurement data using the k-means clustering method in this work.