scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2003"


Proceedings ArticleDOI
05 Mar 2003
TL;DR: Experiments show that LOCI and aLOCI can automatically detect outliers and micro-clusters, without user-required cut-offs, and that they quickly spot both expected and unexpected outliers.
Abstract: Outlier detection is an integral part of data mining and has attracted much attention recently [M. Breunig et al., (2000)], [W. Jin et al., (2001)], [E. Knorr et al., (2000)]. We propose a new method for evaluating outlierness, which we call the local correlation integral (LOCI). As with the best previous methods, LOCI is highly effective for detecting outliers and groups of outliers (a.k.a. micro-clusters). In addition, it offers the following advantages and novelties: (a) It provides an automatic, data-dictated cutoff to determine whether a point is an outlier-in contrast, previous methods force users to pick cut-offs, without any hints as to what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot summarizes a wealth of information about the data in the vicinity of the point, determining clusters, micro-clusters, their diameters and their inter-cluster distances. None of the existing outlier-detection methods can match this feature, because they output only a single number for each point: its outlierness score, (c) Our LOCI method can be computed as quickly as the best previous methods, (d) Moreover, LOCI leads to a practically linear approximate method, aLOCI (for approximate LOCI), which provides fast highly-accurate outlier detection. To the best of our knowledge, this is the first work to use approximate computations to speed up outlier detection. Experiments on synthetic and real world data sets show that LOCI and aLOCI can automatically detect outliers and micro-clusters, without user-required cut-offs, and that they quickly spot both expected and unexpected outliers.

903 citations


Journal ArticleDOI
TL;DR: A measure for identifying the physical significance of an outlier is designed, which is called cluster-based local outlier factor (CBLOF), which is meaningful and provides importance to the local data behavior.

817 citations


Book ChapterDOI
10 Sep 2003
TL;DR: The locally optimized ransac makes no new assumptions about the data, on the contrary – it makes the above-mentioned assumption valid by applying local optimization to the solution estimated from the random sample.
Abstract: A new enhancement of ransac, the locally optimized ransac (lo-ransac), is introduced. It has been observed that, to find an optimal solution (with a given probability), the number of samples drawn in ransac is significantly higher than predicted from the mathematical model. This is due to the incorrect assumption, that a model with parameters computed from an outlier-free sample is consistent with all inliers. The assumption rarely holds in practice. The locally optimized ransac makes no new assumptions about the data, on the contrary – it makes the above-mentioned assumption valid by applying local optimization to the solution estimated from the random sample.

722 citations


Proceedings ArticleDOI
24 Aug 2003
TL;DR: This work shows that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used.
Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

682 citations


Journal ArticleDOI
TL;DR: The theory of Robust Subspace Learning (RSL) for linear models within a continuous optimization framework based on robust M-estimation is developed and applies to a variety of linear learning problems in computer vision including eigen-analysis and structure from motion.
Abstract: Many computer vision, signal processing and statistical problems can be posed as problems of learning low dimensional linear or multi-linear models These models have been widely used for the representation of shape, appearance, motion, etc, in computer vision applications Methods for learning linear models can be seen as a special case of subspace fitting One draw-back of previous learning methods is that they are based on least squares estimation techniques and hence fail to account for “outliers” which are common in realistic training sets We review previous approaches for making linear learning methods robust to outliers and present a new method that uses an intra-sample outlier process to account for pixel outliers We develop the theory of Robust Subspace Learning (RSL) for linear models within a continuous optimization framework based on robust M-estimation The framework applies to a variety of linear learning problems in computer vision including eigen-analysis and structure from motion Several synthetic and natural examples are used to develop and illustrate the theory and applications of robust subspace learning in computer vision

673 citations


Proceedings Article
01 Jan 2003
TL;DR: A novel scheme that uses robust principal component classifier in intrusion detection problems where the training data may be unsupervised and outperforms the nearest neighbor method, density-based local outliers (LOF) approach, and the outlier detection algorithm based on Canberra metric is proposed.
Abstract: : This paper proposes a novel scheme that uses robust principal component classifier in intrusion detection problems where the training data may be unsupervised Assuming that anomalies can be treated as outliers, an intrusion predictive model is constructed from the major and minor principal components of the normal instances A measure of the difference of an anomaly from the normal instance is the distance in the principal component space The distance based on the major components that account for 50% of the total variation and the minor components whose eigenvalues less than 020 is shown to work well The experiments with KDD Cup 1999 data demonstrate that the proposed method achieves 9894% in recall and 9789% in precision with the false alarm rate 092% and outperforms the nearest neighbor method, density-based local outliers (LOF) approach, and the outlier detection algorithm based on Canberra metric

574 citations


Journal ArticleDOI
TL;DR: This paper summarizes the main results of Cazals et al. (2002) on robust nonparametric frontier estimators and proposes a methodology implementing the tool and shows how this tool can be used for detecting outliers when using the classical DEA/FDH estimators or any parametric techniques.
Abstract: In frontier analysis, most of the nonparametric approaches (DEA, FDH) are based on envelopment ideas which suppose that with probability one, all the observed units belong to the attainable set. In these "deterministic'' frontier models, statistical theory is now mostly available (Simar and Wilson, 2000a). In the presence of superefficient outliers, envelopment estimators could behave dramatically since they are very sensitive to extreme observations. Some recent results from Cazals et al. (2002) on robust nonparametric frontier estimators may be used in order to detect outliers by de. ning a new DEA/FDH "deterministic'' type estimator which does not envelop all the data points and so is more robust to extreme data points. In this paper, we summarize the main results of Cazals et al. (2002) and we show how this tool can be used for detecting outliers when using the classical DEA/FDH estimators or any parametric techniques. We propose a methodology implementing the tool and we illustrate through some numerical examples with simulated and real data. The method should be used in a first step, as an exploratory data analysis, before using any frontier estimation.

356 citations


Proceedings ArticleDOI
20 Jul 2003
TL;DR: A new algorithm for time-series novelty detection based on one-class support vector machines (SVMs) is proposed and a technique to combine intermediate results at different phase spaces is proposed in order to obtain robust detection results.
Abstract: Time-series novelty detection, or anomaly detection, refers to the automatic identification of novel or abnormal events embedded in normal time-series points. Although it is a challenging topic in data mining, it has been acquiring increasing attention due to its huge potential for immediate applications. In this paper, a new algorithm for time-series novelty detection based on one-class support vector machines (SVMs) is proposed. The concepts of phase and projected phase spaces are first introduced, which allows us to convert a time-series into a set of vectors in the (projected) phase spaces. Then we interpret novel events in time-series as outliers of the "normal" distribution of the converted vectors in the (projected) phase spaces. One-class SVMs are employed as the outlier detectors. In order to obtain robust detection results, a technique to combine intermediate results at different phase spaces is also proposed. Experiments on both synthetic and measured data are presented to demonstrate the promising performance of the new algorithm.

317 citations


Journal ArticleDOI
TL;DR: A new algorithm for the independent components analysis (ICA) problem based on an efficient entropy estimator that is simple, computationally efficient, intuitively appealing, and outperforms other well known algorithms.
Abstract: This paper presents a new algorithm for the independent components analysis (ICA) problem based on an efficient entropy estimator. Like many previous methods, this algorithm directly minimizes the measure of departure from independence according to the estimated Kullback-Leibler divergence between the joint distribution and the product of the marginal distributions. We pair this approach with efficient entropy estimators from the statistics literature. In particular, the entropy estimator we use is consistent and exhibits rapid convergence. The algorithm based on this estimator is simple, computationally efficient, intuitively appealing, and outperforms other well known algorithms. In addition, the estimator's relative insensitivity to outliers translates into superior performance by our ICA algorithm on outlier tests. We present favorable comparisons to the Kernel ICA, FAST-ICA, JADE, and extended Infomax algorithms in extensive simulations. We also provide public domain source code for our algorithms.

283 citations


Journal ArticleDOI
TL;DR: In this paper, a structural health monitoring methodology for a wing box is presented. Butler et al. used novelty detection based on measured transmissibilities from the structure of the wing.

237 citations


Journal ArticleDOI
TL;DR: A general definition of S-outliers for spatial outliers is provided and the computation structure of spatial outlier detection methods is characterized and scalable algorithms are presented.
Abstract: Spatial outliers represent locations which are significantly different from their neighborhoods even though they may not be significantly different from the entire population. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and implicit knowledge, such as local instability. In this paper, we first provide a general definition of S-outliers for spatial outliers. This definition subsumes the traditional definitions of spatial outliers. Second, we characterize the computation structure of spatial outlier detection methods and present scalable algorithms. Third, we provide a cost model of the proposed algorithms. Finally, we experimentally evaluate our algorithms using a Minneapolis-St. Paul (Twin Cities) traffic data set.

Proceedings ArticleDOI
19 Nov 2003
TL;DR: This work formulates the spatial outlier detection problem in a general way and design algorithms which can accurately detect spatial outliers and demonstrates that their approaches can not only avoid detecting false spatial outiers but also find true spatial outlier ignored by existing methods.
Abstract: A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from the values of its neighborhood. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and useful spatial patterns for further analysis. One drawback of existing methods is that normal objects tend to be falsely detected as spatial outliers when their neighborhood contains true spatial outliers. We propose a suite of spatial outlier detection algorithms to overcome this disadvantage. We formulate the spatial outlier detection problem in a general way and design algorithms which can accurately detect spatial outliers. In addition, using a real-world census data set, we demonstrate that our approaches can not only avoid detecting false spatial outliers but also find true spatial outliers ignored by existing methods.

Journal ArticleDOI
TL;DR: Closest distance to center (CDC) is proposed in this paper as an alternative for outlier detection and better performance was obtained when CDC is incorporated with MVT, compared to using CDC and MVT alone.

Journal ArticleDOI
TL;DR: A simple but useful statistical model is developed for the room transfer function of acoustical source localization methods when room reverberation is present and the so-called PHAT time-delay estimator is shown to be optimal among a class of cross-correlation based time- delay estimators.
Abstract: Room reverberation is typically the main obstacle for designing robust microphone-based source localization systems. The purpose of the paper is to analyze the achievable performance of acoustical source localization methods when room reverberation is present. To facilitate the analysis, we apply well known results from room acoustics to develop a simple but useful statistical model for the room transfer function. The properties of the statistical model are found to correlate well with results from real data measurements. The room transfer function model is further applied to analyze the statistical properties of some existing methods for source localization. In this respect we consider especially the asymptotic error variance and the probability of an anomalous estimate. A noteworthy outcome of the analysis is that the so-called PHAT time-delay estimator is shown to be optimal among a class of cross-correlation based time-delay estimators. To verify our results on the error variance and the outlier probability we apply the image method for simulation of the room transfer function.

Journal ArticleDOI
TL;DR: A robust analysis method is developed for the understanding of large-scale shifts in gene effects and the isolation of particular sample-by-gene effects that might be either unusual interactions or the result of experimental flaws.
Abstract: In microarray data there are a number of biological samples, each assessed for the level of gene expression for a typically large number of genes There is a need to examine these data with statistical techniques to help discern possible patterns in the data Our technique applies a combination of mathematical and statistical methods to progressively take the data set apart so that different aspects can be examined for both general patterns and very specific effects Unfortunately, these data tables are often corrupted with extreme values (outliers), missing values, and non-normal distributions that preclude standard analysis We develop a robust analysis method to address these problems The benefits of this robust analysis will be both the understanding of large-scale shifts in gene effects and the isolation of particular sample-by-gene effects that might be either unusual interactions or the result of experimental flaws Our method requires a single pass and does not resort to complex ”cleaning” or imputation of the data table before analysis We illustrate the method with a commercial data set

Journal ArticleDOI
TL;DR: It is shown that population trimmed L-moments assign zero weight to extreme observations, they are easy to compute, their sample variances and covariances can be obtained in closed form, and they are more robust than L-Moments are to the presence of outliers.

Book
21 Oct 2003
TL;DR: This book presents a meta-modelling framework for estimating the level of uncertainty in the results of cDNA Microarray experiments, as well as some of the techniques used to assess the quality of these experiments.
Abstract: Preface.1 A Brief Introduction.1.1 A Note on Exploratory Data Analysis.1.2 Computing Considerations and Software.1.3 A Brief Outline of the Book.2 Genomics Basics.2.1 Genes.2.2 DNA.2.3 Gene Expression.2.4 Hybridization Assays and Other Laboratory Techniques.2.5 The Human Genome.2.6 Genome Variations and Their Consequences.2.7 Genomics.2.8 The Role of Genomics in Pharmaceutical Research.2.9 Proteins.2.10 Bioinformatics.Supplementary Reading.Exercises.3 Microarrays.3.1 Types of Microarray Experiments.3.1.1 Experiment Type 1: Tissue-Specific Gene Expression.3.1.2 Experiment Type 2: Developmental Genetics.3.1.3 Experiment Type 3: Genetic Diseases.3.1.4 Experiment Type 4: Complex Diseases.3.1.5 Experiment Type 5: Pharmacological Agents.3.1.6 Experiment Type 6: Plant Breeding.3.1.7 Experiment Type 7: Environmental Monitoring.3.2 A Very Simple Hypothetical Microarray Experiment.3.3 A Typical Microarray Experiment.3.3.1 Microarray Preparation.3.3.2 Sample Preparation.3.3.3 The Hybridization Step.3.3.4 Scanning the Microarray.3.3.5 Interpreting the Scanned Image.3.4 Multichannel cDNA Microarrays.3.5 Oligonucleotide Arrays.3.6 Bead-Based Arrays.3.7 Confirmation of Microarray Results.Supplementary Reading and Electronic References.Exercises.4 Processing the Scanned Image.4.1 Converting the Scanned Image to the Spotted Image.4.1.1 Gridding.4.1.2 Segmentation.4.1.3 Quantification.4.2 Quality Assessment.4.2.1 Visualizing the Spotted Image.4.2.2 Numerical Evaluation of Array Quality.4.2.3 Spatial Problems.4.2.4 Spatial Randomness.4.2.5 Quality Control of Arrays.4.2.6 Assessment of Spot Quality.4.3 Adjusting for Background.4.3.1 Estimating the Background.4.3.2 Adjusting for the Estimated Background.4.4 Expression Level Calculation for Two-Channel cDNA Microarrays.4.5 Expression Level Calculation for Oligonucleotide Arrays.4.5.1 The Average Difference.4.5.2 A Weighted Average Difference.4.5.3 Perfect Matches Only.4.5.4 Background Adjustment Approach.4.5.5 Model-Based Approach.4.5.6 Absent-Present Calls.Supplementary Reading.Exercises.5 Preprocessing Microarray Data.5.1 Logarithmic Transformation.5.2 Variance Stabilizing Transformations.5.3 Sources of Bias.5.4 Normalization.5.5 Intensity-Dependent Normalization.5.5.1 Smooth Function Normalization.5.5.2 Quantile Normalization.5.5.3 Normalization of Oligonucleotide Arrays.5.5.4 Normalization of Two-Channel Arrays.5.5.5 Spatial Normalization.5.5.6 Stagewise Normalization.5.6 Judging the Success of a Normalization.5.7 Outlier Identification.5.7.1 Nonresistant Rules for Outlier Identification.5.7.2 Resistant Rules for Outlier Identification.5.8 Assessing Replicate Array Quality.Exercises.6 Summarization.6.1 Replication.6.2 Technical Replicates.6.3 Biological Replicates.6.4 Experiments with Both Technical and Biological Replicates.6.5 Multiple Oligonucleotide Arrays.6.6 Estimating Fold Change in Two-Channel Experiments.6.7 Bayes Estimation of Fold Change.Exercises.7 Two-Group Comparative Experiments.7.1 Basics of Statistical Hypothesis Testing.7.2 Fold Changes.7.3 The Two-Sample t Test.7.4 Diagnostic Checks.7.5 Robust t Tests.7.6 Randomization Tests.7.7 The Mann-Whitney-Wilcoxon Rank Sum Test.7.8 Multiplicity.7.8.1 A Pragmatic Approach to the Issue of Multiplicity.7.8.2 Simple Multiplicity Adjustments.7.8.3 Sequential Multiplicity Adjustments.7.9 The False Discovery Rate.7.9.1 The Positive False Discovery Rate.7.10 Small Variance-Adjusted t Tests and SAM.7.10.1 Modifying the t Statistic.7.10.2 Assesing Significance with the SAM t Statistic.7.10.3 Strategies for Using SAM.7.10.4 An Empirical Bayes Framework.7.10.5 Understanding the SAM Adjustment.7.11 Conditional t.7.12 Borrowing Strength across Genes.7.12.1 Simple Methods.7.12.2 A Bayesian Model.7.13 Two-Channel Experiments.7.13.1 The Paired Sample t Test and SAM.7.13.2 Borrowing Strength via Hierarchical Modeling.Supplementary Reading.Exercises.8 Model-Based Inference and Experimental Design Considerations.8.1 The F Test.8.2 The Basic Linear Model.8.3 Fitting the Model in Two Stages.8.4 Multichannel Experiments.8.5 Experimental Design Considerations.8.5.1 Comparing Two Varieties with Two-Channel Microarrays.8.5.2 Comparing Multiple Varieties with Two-Channel Microarrays.8.5.3 Single-Channel Microarray Experiments.8.6 Miscellaneous Issues.Supplementary Reading.Exercises.9 Pattern Discovery.9.1 Initial Considerations.9.2 Cluster Analysis.9.2.1 Dissimilarity Measures and Similarity Measures.9.2.2 Guilt by Association.9.2.3 Hierarchical Clustering.9.2.4 Partitioning Methods.9.2.5 Model-Based Clustering.9.2.6 Chinese Restaurant Clustering.9.2.7 Discussion.9.3 Seeking Patterns Visually.9.3.1 Principal Components Analysis.9.3.2 Factor Analysis.9.3.3 Biplots.9.3.4 Spectral Map Analysis.9.3.5 Multidimensional Scaling.9.3.6 Projection Pursuit.9.3.7 Data Visualization with the Grand Tour and Projection Pursuit.9.4 Two-Way Clustering.9.4.1 Block Clustering.9.4.2 Gene Shaving.9.4.3 The Plaid Model.Software Notes.Supplementary Reading.Exercises.10 Class Prediction.10.1 Initial Considerations.10.1.1 Misclassification Rates.10.1.2 Reducing the Number of Classifiers.10.2 Linear Discriminant Analysis.10.3 Extensions of Fisher's LDA.10.4 Nearest Neighbors.10.5 Recursive Partitioning.10.5.1 Classification Trees.10.5.2 Activity Region Finding.10.6 Neural Networks.10.7 Support Vector Machines.10.8 Integration of Genomic Information.10.8.1 Integration of Gene Expression Data and Molecular Structure Data.10.8.2 Pathway Inference.Software Notes.Supplementary Reading.Exercises.11 Protein Arrays.11.1 Introduction.11.2 Protein Array Experiments.11.3 Special Issues with Protein Arrays.11.4 Analysis.11.5 Using Antibody Antigen Arrays to Measure Protein Concentrations.Exercises.References.Author Index.Subject Index.

Journal ArticleDOI
TL;DR: The convergence rate of SVIRNs is faster than the conventional networks with BP learning algorithms or with robust BPlearning algorithms for interval regression analysis, and a traditional back-propagation (BP) learning algorithm can be used to adjust the initial structure networks of SVirNs under training data sets without or with outliers.

Journal ArticleDOI
TL;DR: This letter argues that many visual scenes are based on a Manhattan three-dimensional grid that imposes regularities on the image statistics, and constructs a Bayesian model that implements this assumption and estimates the viewer orientation relative to the Manhattan grid.
Abstract: This letter argues that many visual scenes are based on a "Manhattan" three-dimensional grid that imposes regularities on the image statistics. We construct a Bayesian model that implements this assumption and estimates the viewer orientation relative to the Manhattan grid. For many images, these estimates are good approximations to the viewer orientation (as estimated manually by the authors). These estimates also make it easy to detect outlier structures that are unaligned to the grid. To determine the applicability of the Manhattan world model, we implement a null hypothesis model that assumes that the image statistics are independent of any three-dimensional scene structure. We then use the log-likelihood ratio test to determine whether an image satisfies the Manhattan world assumption. Our results show that if an image is estimated to be Manhattan, then the Bayesian model's estimates of viewer direction are almost always accurate (according to our manual estimates), and vice versa.

Proceedings ArticleDOI
19 Nov 2003
TL;DR: It is proved that additive Gaussian distribution is not a proper model for super-resolution noise and it is shown that Lp norm minimization results in a pixelwise weighted mean algorithm which requires the least possible amount of computation time and memory and produces a maximum likelihood solution.
Abstract: In the last two decades, many papers have been published, proposing a variety of methods for multi-frame resolution enhancement. These methods, which have a wide range of complexity, memory and time requirements, are usually very sensitive to their assumed model of data and noise, often limiting their utility. Different implementations of the non-iterative Shift and Add concept have been proposed as very fast and effective super-resolution algorithms. The paper of Elad & Hel-Or 2001 provided an adequate mathematical justification for the Shift and Add method for the simple case of an additive Gaussian noise model. In this paper we prove that additive Gaussian distribution is not a proper model for super-resolution noise. Specifically, we show that Lp norm minimization (1≤p≤2) results in a pixelwise weighted mean algorithm which requires the least possible amount of computation time and memory and produces a maximum likelihood solution. We also justify the use of a robust prior information term based on bilateral filter idea. Finally, for the underdetermined case, where the number of non-redundant low-resolution frames are less than square of the resolution enhancement factor, we propose a method for detection and removal of outlier pixels. Our experiments using commercialdigital cameras show that our proposed super-resolution method provides significant improvements in both accuracy and efficiency.

Journal ArticleDOI
TL;DR: This study applies the so-called jack-knife technique to PARAFAC in order to find the associated standard errors to the parameter estimates from thePARAFAC model and shows the applicability of the method.

Journal ArticleDOI
TL;DR: A novel mixture model is proposed which treats as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each sample, to address problems involving both the known, and unknown classes.
Abstract: Several authors have shown that, when labeled data are scarce, improved classifiers can be built by augmenting the training set with a large set of unlabeled examples and then performing suitable learning. These works assume each unlabeled sample originates from one of the (known) classes. Here, we assume each unlabeled sample comes from either a known or from a heretofore undiscovered class. We propose a novel mixture model which treats as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each sample. Two types of mixture components are posited. "Predefined" components generate data from known classes and assume class labels are missing at random. "Nonpredefined" components only generate unlabeled data-i.e., they capture exclusively unlabeled subsets, consistent with an outlier distribution or new classes. The predefined/nonpredefined natures are data-driven, learned along with the other parameters via an extension of the EM algorithm. Our modeling framework addresses problems involving both the known,and unknown classes: (1) robust classifier design, (2) classification with rejections, and (3) identification of the unlabeled samples (and their components) from unknown classes. Case 3 is a step toward new class discovery. Experiments are reported for each application, including topic discovery for the Reuters domain. Experiments also demonstrate the value of label presence/absence data in learning accurate mixtures.

Book ChapterDOI
01 Jan 2003
TL;DR: In this paper, the authors proposed several new measures of skewness which are more robust against outlying values, and compared their properties using both real and simulated data, using both simulated and real data sets.
Abstract: Asymmetry of a univariate continuous distribution is commonly described as skewness. The well-known classical skewness coefficient is based on the first three moments of the data set, and hence it is strongly affected by the presence of one or more outliers. In this paper we propose several new measures of skewness which are more robust against outlying values. Their properties are compared using both real and simulated data.

Journal ArticleDOI
TL;DR: It is shown that Vogelsang's iterative method to detect outliers is incorrect and an alternative method based on first‐differenced data that has considerably more power is proposed that leads to unit root tests with more accurate finite sample size and robustness to departures from a unit root.
Abstract: Recently, Vogelsang (1999) proposed a method to detect outliers which explicitly imposes the null hypothesis of a unit root It works in an iterative fashion to select multiple outlier in a given series We show, via simulations, that, under the null hypothesis of no outliers, it has the right size in finite samples to detect a single outlier but, when applied in an iterative fashion to select multiple outliers, it exhibits severe size distortions towards finding an excessive number of outliers We show that his iterative method is incorrect and derive the appropriate limiting distribution of the test at each step of the search Whether corrected or not, we also show that the outliers need to be very large for the method to have any decent power We propose an alternative method based on first-differenced data that has considerably more power We also show that our method to identify outliers leads to unit root tests with more accurate finite sample size and robustness to departures from a unit root The issues are illustrated using two US/Finland real-exchange rate series

Journal ArticleDOI
TL;DR: In this article, the authors present an algorithm that integrates image-feature tracking and 3D motion estimation into a closed loop, while detecting and rejecting outlier regions that do not fit the model.
Abstract: The problem of structure from motion is often decomposed into two steps: feature correspondence and three-dimensional reconstruction. This separation often causes gross errors when establishing correspondence fails. Therefore, we advocate the necessity to integrate visual information not only in time (i.e. across different views), but also in space, by matching regions --- rather than points --- using explicit photometric deformation models. We present an algorithm that integrates image-feature tracking and three-dimensional motion estimation into a closed loop, while detecting and rejecting outlier regions that do not fit the model. Due to occlusions and the causal nature of our algorithm, a drift in the estimates accumulates over time. We describe a method to perform global registration of local estimates of motion and structure by matching the appearance of feature regions stored over long time periods. We use image intensities to construct a score function that takes into account changes in brightness and contrast. Our algorithm is recursive and suitable for real-time implementation.

Proceedings ArticleDOI
16 Jul 2003
TL;DR: This paper focuses on the density-based notion that discovers local outliers by means of the local outlier factor (LOF) formulation and three enhancement schemes over LOF are introduced, namely LOF' and LOF" and GridLOF.
Abstract: Outliers, commonly referred to as exceptional cases, exist in many real-world databases. Detection of such outliers is important for many applications. In this paper, we focus on the density-based notion that discovers local outliers by means of the local outlier factor (LOF) formulation. Three enhancement schemes over LOF are introduced, namely LOF' and LOF" and GridLOF. Thorough explanation and analysis is given to demonstrate the abilities of LOF' in providing simpler and more intuitive meaning of local outlier-ness; LOF" in handling cases where LOF fails to work appropriately; and GridLOF in improving the efficiency and accuracy.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a robust principal component regression (RPCR) method for multivariate calibration model, which combines principal component analysis (PCA) on the regressors with least square regression.
Abstract: We consider the multivariate calibration model which assumes that the concentrations of several constituents of a sample are linearly related to its spectrum. Principal component regression (PCR) is widely used for the estimation of the regression parameters in this model. In the classical approach it combines principal component analysis (PCA) on the regressors with least squares regression. However, both stages yield very unreliable results when the data set contains outlying observations. We present a robust PCR (RPCR) method which also consists of two parts. First we apply a robust PCA method for high-dimensional data on the regressors, then we regress the response variables on the scores using a robust regression method. A robust RMSECV value and a robust R 2 value are proposed as exploratory tools to select the number of principal components. The prediction error is also estimated in a robust way. Moreover, we introduce several diagnostic plots which are helpful to visualize and classify the outliers. The robustness of RPCR is demonstrated through simulations and the analysis of a real data set.

Journal ArticleDOI
TL;DR: The hidden logistic regression model (HLS) model as discussed by the authors is a generalization of the LSTM model, where the unobservable true responses are comparable to a hidden layer in a feed forward neural network.

Journal ArticleDOI
01 Dec 2003-Icarus
TL;DR: In this paper, the authors describe a method for carefully assessing the statistical performance of the various observatories that have produced asteroid astrometry, with the ultimate goal of using this statistical characterization to improve asteroid orbit determination.

Proceedings ArticleDOI
03 Nov 2003
TL;DR: This paper proposes two approaches to discover spatial outliers with multiple attributes, formulate the multi-attribute spatial outlier detection problem in a general way, provide two effective detection algorithms, and analyze their computation complexity.
Abstract: A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from the values of its neighborhood. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and useful spatial patterns for further analysis. Previous work in spatial outlier detection focuses on detecting spatial outliers with a single attribute. In the paper, we propose two approaches to discover spatial outliers with multiple attributes. We formulate the multi-attribute spatial outlier detection problem in a general way, provide two effective detection algorithms, and analyze their computation complexity. In addition, using a real-world census data, we demonstrate that our approaches can effectively identify local abnormality in large spatial data sets.