scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2012"


Journal ArticleDOI
TL;DR: In this article, a Bayesian maximum a posteriori (MAP) approach is presented, where a subset of highly correlated and quiet stars is used to generate a cotrending basis vector set, which is in turn used to establish a range of "reasonable" robust fit parameters.
Abstract: With the unprecedented photometric precision of the Kepler spacecraft, significant systematic and stochastic errors on transit signal levels are observable in the Kepler photometric data. These errors, which include discontinuities, outliers, systematic trends, and other instrumental signatures, obscure astrophysical signals. The presearch data conditioning (PDC) module of the Kepler data analysis pipeline tries to remove these errors while preserving planet transits and other astrophysically interesting signals. The completely new noise and stellar vari- ability regime observed inKepler data poses a significant problem to standard cotrending methods. Variable stars are often of particular astrophysical interest, so the preservation of their signals is of significant importance to the astrophysical community. We present a Bayesian maximum a posteriori (MAP) approach, where a subset of highly correlated and quiet stars is used to generate a cotrending basis vector set, which is in turn used to establish a range of "reasonable" robust fit parameters. These robust fit parameters are then used to generate a Bayesian prior and a Bayesian posterior probability distribution function (PDF) which, when maximized, finds the best fit that simulta- neously removes systematic effects while reducing the signal distortion and noise injection that commonly afflicts simple least-squares (LS) fitting. A numerical and empirical approach is taken where the Bayesian prior PDFs are generated from fits to the light-curve distributions themselves.

721 citations


Journal ArticleDOI
TL;DR: This survey article discusses some important aspects of the ‘curse of dimensionality’ in detail and surveys specialized algorithms for outlier detection from both categories.
Abstract: High-dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so-called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high-dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high-dimensional data. In this survey article, we discuss some important aspects of the ‘curse of dimensionality’ in detail and survey specialized algorithms for outlier detection from both categories. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 © 2012 Wiley Periodicals, Inc.

699 citations


Journal ArticleDOI
TL;DR: In this article, a Bayesian Maximum A Posteriori (MAP) approach is presented where a subset of highly correlated and quiet stars is used to generate a cotrending basis vector set which is in turn used to establish a range of "reasonable" robust fit parameters.
Abstract: With the unprecedented photometric precision of the Kepler Spacecraft, significant systematic and stochastic errors on transit signal levels are observable in the Kepler photometric data. These errors, which include discontinuities, outliers, systematic trends and other instrumental signatures, obscure astrophysical signals. The Presearch Data Conditioning (PDC) module of the Kepler data analysis pipeline tries to remove these errors while preserving planet transits and other astrophysically interesting signals. The completely new noise and stellar variability regime observed in Kepler data poses a significant problem to standard cotrending methods such as SYSREM and TFA. Variable stars are often of particular astrophysical interest so the preservation of their signals is of significant importance to the astrophysical community. We present a Bayesian Maximum A Posteriori (MAP) approach where a subset of highly correlated and quiet stars is used to generate a cotrending basis vector set which is in turn used to establish a range of "reasonable" robust fit parameters. These robust fit parameters are then used to generate a Bayesian Prior and a Bayesian Posterior Probability Distribution Function (PDF) which when maximized finds the best fit that simultaneously removes systematic effects while reducing the signal distortion and noise injection which commonly afflicts simple least-squares (LS) fitting. A numerical and empirical approach is taken where the Bayesian Prior PDFs are generated from fits to the light curve distributions themselves.

520 citations


Journal ArticleDOI
TL;DR: In this paper, an efficient convex optimization-based algorithm that is called outlier pursuit is presented, which under some mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace and identifies the corrupted points.
Abstract: Singular-value decomposition (SVD) [and principal component analysis (PCA)] is one of the most widely used techniques for dimensionality reduction: successful and efficiently computable, it is nevertheless plagued by a well-known, well-documented sensitivity to outliers. Recent work has considered the setting where each point has a few arbitrarily corrupted components. Yet, in applications of SVD or PCA, such as robust collaborative filtering or bioinformatics, malicious agents, defective genes, or simply corrupted or contaminated experiments may effectively yield entire points that are completely corrupted. We present an efficient convex optimization-based algorithm that we call outlier pursuit, which under some mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace and identifies the corrupted points. Such identification of corrupted points that do not conform to the low-dimensional approximation is of paramount interest in bioinformatics, financial applications, and beyond. Our techniques involve matrix decomposition using nuclear norm minimization; however, our results, setup, and approach necessarily differ considerably from the existing line of work in matrix completion and matrix decomposition, since we develop an approach to recover the correct column space of the uncorrupted matrix, rather than the exact matrix itself. In any problem where one seeks to recover a structure rather than the exact initial matrices, techniques developed thus far relying on certificates of optimality will fail. We present an important extension of these methods, which allows the treatment of such problems.

388 citations


Proceedings ArticleDOI
01 Apr 2012
TL;DR: A novel subspace search method that selects high contrast subspaces for density-based outlier ranking and proposes a first measure for the contrast of subspace dimensions to enhance the quality of traditional outlier rankings.
Abstract: Outlier mining is a major task in data analysis. Outliers are objects that highly deviate from regular objects in their local neighborhood. Density-based outlier ranking methods score each object based on its degree of deviation. In many applications, these ranking methods degenerate to random listings due to low contrast between outliers and regular objects. Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. Measuring the contrast of such subspaces for outlier rankings is an open research challenge. In this work, we propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. It is designed as pre-processing step to outlier ranking algorithms. It searches for high contrast subspaces with a significant amount of conditional dependence among the subspace dimensions. With our approach, we propose a first measure for the contrast of subspaces. Thus, we enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that our approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking.

353 citations


Journal ArticleDOI
TL;DR: The treatment concerns statistical robustness, which deals with deviations from the distributional assumptions, and addresses single and multichannel estimation problems as well as linear univariate regression for independently and identically distributed (i.i.d.) data.
Abstract: The word robust has been used in many contexts in signal processing. Our treatment concerns statistical robustness, which deals with deviations from the distributional assumptions. Many problems encountered in engineering practice rely on the Gaussian distribution of the data, which in many situations is well justified. This enables a simple derivation of optimal estimators. Nominal optimality, however, is useless if the estimator was derived under distributional assumptions on the noise and the signal that do not hold in practice. Even slight deviations from the assumed distribution may cause the estimator's performance to drastically degrade or to completely break down. The signal processing practitioner should, therefore, ask whether the performance of the derived estimator is acceptable in situations where the distributional assumptions do not hold. Isn't it robustness that is of a major concern for engineering practice? Many areas of engineering today show that the distribution of the measurements is far from Gaussian as it contains outliers, which cause the distribution to be heavy tailed. Under such scenarios, we address single and multichannel estimation problems as well as linear univariate regression for independently and identically distributed (i.i.d.) data. A rather extensive treatment of the important and challenging case of dependent data for the signal processing practitioner is also included. For these problems, a comparative analysis of the most important robust methods is carried out by evaluating their performance theoretically, using simulations as well as real-world data.

339 citations


01 Jan 2012
TL;DR: A histogrambased outlier detection (HBOS) algorithm is presented, which scores records in linear time and assumes independence of the features making it much faster than multivariate approaches at the cost of less precision.
Abstract: Unsupervised anomaly detection is the process of nding outliers in data sets without prior training. In this paper, a histogrambased outlier detection (HBOS) algorithm is presented, which scores records in linear time. It assumes independence of the features making it much faster than multivariate approaches at the cost of less precision. A comparative evaluation on three UCI data sets and 10 standard algorithms show, that it can detect global outliers as reliable as state-of-theart algorithms, but it performs poor on local outlier problems. HBOS is in our experiments up to 5 times faster than clustering based algorithms and up to 7 times faster than nearest-neighbor based methods.

312 citations


Journal ArticleDOI
TL;DR: A robust recurrent neural network is presented in a Bayesian framework based on echo state mechanisms that is robust in the presence of outliers and is superior to existing methods.
Abstract: In this paper, a robust recurrent neural network is presented in a Bayesian framework based on echo state mechanisms. Since the new model is capable of handling outliers in the training data set, it is termed as a robust echo state network (RESN). The RESN inherits the basic idea of ESN learning in a Bayesian framework, but replaces the commonly used Gaussian distribution with a Laplace one, which is more robust to outliers, as the likelihood function of the model output. Moreover, the training of the RESN is facilitated by employing a bound optimization algorithm, based on which, a proper surrogate function is derived and the Laplace likelihood function is approximated by a Gaussian one, while remaining robust to outliers. It leads to an efficient method for estimating model parameters, which can be solved by using a Bayesian evidence procedure in a fully autonomous way. Experimental results show that the proposed method is robust in the presence of outliers and is superior to existing methods.

294 citations


Proceedings ArticleDOI
12 Aug 2012
TL;DR: This paper proposes a Robust Multi-Task Feature Learning algorithm (rMTFL) which simultaneously captures a common set of features among relevant tasks and identifies outlier tasks, and provides a detailed theoretical analysis on the proposed rMTFL formulation.
Abstract: Multi-task learning (MTL) aims to improve the performance of multiple related tasks by exploiting the intrinsic relationships among them. Recently, multi-task feature learning algorithms have received increasing attention and they have been successfully applied to many applications involving high dimensional data. However, they assume that all tasks share a common set of features, which is too restrictive and may not hold in real-world applications, since outlier tasks often exist. In this paper, we propose a Robust Multi-Task Feature Learning algorithm (rMTFL) which simultaneously captures a common set of features among relevant tasks and identifies outlier tasks. Specifically, we decompose the weight (model) matrix for all tasks into two components. We impose the well-known group Lasso penalty on row groups of the first component for capturing the shared features among relevant tasks. To simultaneously identify the outlier tasks, we impose the same group Lasso penalty but on column groups of the second component. We propose to employ the accelerated gradient descent to efficiently solve the optimization problem in rMTFL, and show that the proposed algorithm is scalable to large-size problems. In addition, we provide a detailed theoretical analysis on the proposed rMTFL formulation. Specifically, we present a theoretical bound to measure how well our proposed rMTFL approximates the true evaluation, and provide bounds to measure the error between the estimated weights of rMTFL and the underlying true weights. Moreover, by assuming that the underlying true weights are above the noise level, we present a sound theoretical result to show how to obtain the underlying true shared features and outlier tasks (sparsity patterns). Empirical studies on both synthetic and real-world data demonstrate that our proposed rMTFL is capable of simultaneously capturing shared features among tasks and identifying outlier tasks.

291 citations


Journal ArticleDOI
TL;DR: In this paper, a robust rank correlation screening (RRCS) method is proposed to deal with ultra-high dimensional data, which is based on the Kendall correlation coefficient between response and predictor variables rather than the Pearson correlation.
Abstract: Independence screening is a variable selection method that uses a ranking criterion to select significant variables, particularly for statistical models with nonpolynomial dimensionality or “large $p$, small $n$” paradigms when $p$ can be as large as an exponential of the sample size $n$. In this paper we propose a robust rank correlation screening (RRCS) method to deal with ultra-high dimensional data. The new procedure is based on the Kendall $\tau$ correlation coefficient between response and predictor variables rather than the Pearson correlation of existing methods. The new method has four desirable features compared with existing independence screening methods. First, the sure independence screening property can hold only under the existence of a second order moment of predictor variables, rather than exponential tails or alikeness, even when the number of predictor variables grows as fast as exponentially of the sample size. Second, it can be used to deal with semiparametric models such as transformation regression models and single-index models under monotonic constraint to the link function without involving nonparametric estimation even when there are nonparametric functions in the models. Third, the procedure can be largely used against outliers and influence points in the observations. Last, the use of indicator functions in rank correlation screening greatly simplifies the theoretical derivation due to the boundedness of the resulting statistics, compared with previous studies on variable screening. Simulations are carried out for comparisons with existing methods and a real data example is analyzed.

237 citations


Journal ArticleDOI
TL;DR: This paper proposes a robust method for recovering signals from 1-bit measurements using adaptive outlier pursuit that will detect the positions where sign flips happen and recover the signals using “correct” measurements.
Abstract: In compressive sensing (CS), the goal is to recover signals at reduced sample rate compared to the classic Shannon-Nyquist rate. However, the classic CS theory assumes the measurements to be real-valued and have infinite bit precision. The quantization of CS measurements has been studied recently and it has been shown that accurate and stable signal acquisition is possible even when each measurement is quantized to only one single bit. There are many algorithms proposed for 1-bit compressive sensing and they work well when there is no noise in the measurements, e.g., there are no sign flips, while the performance is worsened when there are a lot of sign flips in the measurements. In this paper, we propose a robust method for recovering signals from 1-bit measurements using adaptive outlier pursuit. This method will detect the positions where sign flips happen and recover the signals using “correct” measurements. Numerical experiments show the accuracy of sign flips detection and high performance of signal recovery for our algorithms compared with other algorithms.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A generalized view of evaluation methods is presented that allows both to evaluate the performance of existing methods as well as to compare different methods w.r.t. their detection performance.
Abstract: Outlier detection research is currently focusing on the development of new methods and on improving the computation time for these methods. Evaluation however is rather heuristic, often considering just precision in the top k results or using the area under the ROC curve. These evaluation procedures do not allow for assessment of similarity between methods. Judging the similarity of or correlation between two rankings of outlier scores is an important question in itself but it is also an essential step towards meaningfully building outlier detection ensembles, where this aspect has been completely ignored so far. In this study, our generalized view of evaluation methods allows both to evaluate the performance of existing methods as well as to compare different methods w.r.t. their detection performance. Our new evaluation framework takes into consideration the class imbalance problem and offers new insights on similarity and redundancy of existing outlier detection methods. As a result, the design of effective ensemble methods for outlier detection is considerably enhanced.

Journal ArticleDOI
TL;DR: This work presents a computationally efficient frequentist approach to FMRI group analysis, which it is term mixed-effects multilevel analysis (MEMA), that incorporates both the variability across subjects and the precision estimate of each effect of interest from individual subject analyses.

Journal ArticleDOI
TL;DR: It is concluded that the incorporation of tools for outlier detection in WSNs can be based on current statistical methodology, which provides a usable and important tool in a novel scientific field.
Abstract: Wireless sensor network WSN applications require efficient, accurate and timely data analysis in order to facilitate near real-time critical decision-making and situation awareness. Accurate analysis and decision-making relies on the quality of WSN data as well as on the additional information and context. Raw observations collected from sensor nodes, however, may have low data quality and reliability due to limited WSN resources and harsh deployment environments. This article addresses the quality of WSN data focusing on outlier detection. These are defined as observations that do not conform to the expected behaviour of the data. The developed methodology is based on time-series analysis and geostatistics. Experiments with a real data set from the Swiss Alps showed that the developed methodology accurately detected outliers in WSN data taking advantage of their spatial and temporal correlations. It is concluded that the incorporation of tools for outlier detection in WSNs can be based on current statistical methodology. This provides a usable and important tool in a novel scientific field.

Proceedings ArticleDOI
12 Nov 2012
TL;DR: Nonlinear Kalman filter and Rauch-Tung-Striebel smoother type recursive estimators for nonlinear discrete-time state space models with multivariate Student's t-distributed measurement noise are presented.
Abstract: Nonlinear Kalman filter and Rauch-Tung-Striebel smoother type recursive estimators for nonlinear discrete-time state space models with multivariate Student's t-distributed measurement noise are presented. The methods approximate the posterior state at each time step using the variational Bayes method. The nonlinearities in the dynamic and measurement models are handled using the nonlinear Gaussian filtering and smoothing approach, which encompasses many known nonlinear Kalman-type filters. The method is compared to alternative methods in a computer simulation.

Posted Content
TL;DR: In this paper, the effects of different treatment of extreme observations on model estimation and on determining the number of spikes (outliers) were examined, and the results for the estimation of the seasonal and stochastic components of electricity spot prices using either the original or filtered data.
Abstract: An important issue in fitting stochastic models to electricity spot prices is the estimation of a component to deal with trends and seasonality in the data. Unfortunately, estimation routines for the long-term and short-term seasonal pattern are usually quite sensitive to extreme observations, known as electricity price spikes. Improved robustness of the model can be achieved by (a) filtering the data with some reasonable procedure for outlier detection, and then (b) using estimation and testing procedures on the filtered data. In this paper we examine the effects of different treatment of extreme observations on model estimation and on determining the number of spikes (outliers). In particular we compare results for the estimation of the seasonal and stochastic components of electricity spot prices using either the original or filtered data. We find significant evidence for a superior estimation of both the seasonal short-term and long-term components when the data have been treated carefully for outliers. Overall, our findings point out the substantial impact the treatment of extreme observations may have on these issues and, therefore, also on the pricing of electricity derivatives like futures and option contracts. An added value of our study is the ranking of different filtering techniques used in the energy economics literature, suggesting which methods could be and which should not be used for spike identification.

01 Jan 2012
TL;DR: This paper attempts to bring together various outlier detection techniques, in a structured and generic description, to attain a better understanding of the different directions of research on outlier analysis for ourselves as well as for beginners in this research field.
Abstract: Outliers once upon a time regarded as noisy data in statistics, has turned out to be an important problem which is being researched in diverse fields of research and application domains. Many outlier detection techniques have been developed specific to certain application domains, while some techniques are more generic. Some application domains are being researched in strict confidentiality such as research on crime and terrorist activities. The techniques and results of such techniques are not readily forthcoming. A number of surveys, research and review articles and books cover outlier detection techniques in machine learning and statistical domains individually in great details. In this paper we make an attempt to bring together various outlier detection techniques, in a structured and generic description. With this exercise, we hope to attain a better understanding of the different directions of research on outlier analysis for ourselves as well as for beginners in this research field who could then pick up the links to different areas of applications in details.

Journal ArticleDOI
TL;DR: The R package MixSim is a new tool that allows simulating mixtures of Gaussian distributions with different levels of overlap between mixture components, and can be readily employed to control the clustering complexity of datasets simulated from mixtures.
Abstract: The R package MixSim is a new tool that allows simulating mixtures of Gaussian distributions with different levels of overlap between mixture components. Pairwise overlap, defined as a sum of two misclassification probabilities, measures the degree of interaction between components and can be readily employed to control the clustering complexity of datasets simulated from mixtures. These datasets can then be used for systematic performance investigation of clustering and finite mixture modeling algorithms. Among other capabilities of MixSim, there are computing the exact overlap for Gaussian mixtures, simulating Gaussian and non-Gaussian data, simulating outliers and noise variables, calculating various measures of agreement between two partitionings, and constructing parallel distribution plots for the graphical display of finite mixture models. All features of the package are illustrated in great detail. The utility of the package is highlighted through a small comparison study of several popular clustering algorithms.

Journal ArticleDOI
TL;DR: The algorithm flowPeaks is automatic, fast and reliable and robust to cluster shape and outliers and it has been compared with state of the art algorithms, including Misty Mountain, FLOCK, flowMeans, flowMerge and FLAME.
Abstract: Motivation: For flow cytometry data, there are two common approaches to the unsupervised clustering problem: one is based on the finite mixture model and the other on spatial exploration of the histograms. The former is computationally slow and has difficulty to identify clusters of irregular shapes. The latter approach cannot be applied directly to high-dimensional data as the computational time and memory become unmanageable and the estimated histogram is unreliable. An algorithm without these two problems would be very useful. Results: In this article, we combine ideas from the finite mixture model and histogram spatial exploration. This new algorithm, which we call flowPeaks, can be applied directly to high-dimensional data and identify irregular shape clusters. The algorithm first uses K-means algorithm with a large K to partition the cell population into many small clusters. These partitioned data allow the generation of a smoothed density function using the finite mixture model. All local peaks are exhaustively searched by exploring the density function and the cells are clustered by the associated local peak. The algorithm flowPeaks is automatic, fast and reliable and robust to cluster shape and outliers. This algorithm has been applied to flow cytometry data and it has been compared with state of the art algorithms, including Misty Mountain, FLOCK, flowMeans, flowMerge and FLAME. Availability: The R package flowPeaks is available at https://github.com/yongchao/flowPeaks. Contact: ude.mssm@eg.oahcgnoy Supplementary information: Supplementary data are available at Bioinformatics online

Journal ArticleDOI
TL;DR: The RANSAC algorithm (RANdom SAmple Consensus) is a robust method to estimate parameters of a model tting the data, in presence of outliers among the data.
Abstract: The RANSAC [2] algorithm (RANdom SAmple Consensus) is a robust method to estimate parameters of a model tting the data, in presence of outliers among the data. Its random nature is due only to complexity considerations. It iteratively extracts a random sample out of all data, of minimal size sucient to estimate the parameters. At each such trial, the number

Book
21 Mar 2012
TL;DR: This paper presents a meta-modelling approach to estimating population sizes using a model-based approach, and demonstrates the power of this approach to estimate the uncertainty in a large number of modeled population sizes.
Abstract: PART I: BASICS OF MODEL-BASED SURVEY INFERENCE 1. Introduction 2. The Model-Based Approach 3. Homogeneous Populations 4. Stratified Populations 5. Populations with Regression Structure 6. Clustered Populations 7. The General Linear Population Model PART II: ROBUST MODEL-BASED INFERENCE 8. Robust Prediction under Model Misspecification 9. Robust Estimation of the Prediction Variance 10. Outlier Robust Prediction PART III: APPLICATIONS OF MODEL-BASED SURVEY INFERENCE 11. Inference for Nonlinear Population Parameters 12. Survey Inference via Sub-Sampling 13. Estimation for Multipurpose Surveys 14. Inference for Domains 15. Prediction for Small Areas 16. Model-Based Inference for Distributions and Quantiles 17. Using Transformations in Sample Survey Inference Exercises

Journal ArticleDOI
TL;DR: In this article, the authors explored and compared the application of three different approaches to the data normalization problem in structural health monitoring (SHM), which concerns the removal of confounding trends induced by varying operational conditions from a measured structural response that correlates with damage.
Abstract: This paper explores and compares the application of three different approaches to the data normalization problem in structural health monitoring (SHM), which concerns the removal of confounding trends induced by varying operational conditions from a measured structural response that correlates with damage. The methodologies for singling out or creating damage-sensitive features that are insensitive to environmental influences explored here include cointegration, outlier analysis and an approach relying on principal component analysis. The application of cointegration is a new idea for SHM from the field of econometrics, and this is the first work in which it has been comprehensively applied to an SHM problem. Results when applying cointegration are compared with results from the more familiar outlier analysis and an approach that uses minor principal components. The ability of these methods for removing the effects of environmental/operational variations from damage-sensitive features is demonstrated and compared with benchmark data from the Brite-Euram project DAMASCOS (BE97 4213), which was collected from a Lamb-wave inspection of a composite panel subject to temperature variations in an environmental chamber.

Proceedings ArticleDOI
12 Aug 2012
TL;DR: A novel random projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data and introduces a theoretical analysis of the quality of approximation to guarantee the reliability of the estimation algorithm.
Abstract: Outlier mining in d-dimensional point sets is a fundamental and well studied data mining task due to its variety of applications. Most such applications arise in high-dimensional domains. A bottleneck of existing approaches is that implicit or explicit assessments on concepts of distance or nearest neighbor are deteriorated in high-dimensional data. Following up on the work of Kriegel et al. (KDD '08), we investigate the use of angle-based outlier factor in mining high-dimensional outliers. While their algorithm runs in cubic time (with a quadratic time heuristic), we propose a novel random projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, our approach is suitable to be performed in parallel environment to achieve a parallel speedup. We introduce a theoretical analysis of the quality of approximation to guarantee the reliability of our estimation algorithm. The empirical experiments on synthetic and real world data sets demonstrate that our approach is efficient and scalable to very large high-dimensional data sets.

Proceedings ArticleDOI
10 Dec 2012
TL;DR: A novel outlier detection model to find outliers that deviate from the generating mechanisms of normal instances by considering combinations of different subsets of attributes, as they occur when there are local correlations in the data set is proposed.
Abstract: In this paper, we propose a novel outlier detection model to find outliers that deviate from the generating mechanisms of normal instances by considering combinations of different subsets of attributes, as they occur when there are local correlations in the data set. Our model enables to search for outliers in arbitrarily oriented subspaces of the original feature space. We show how in addition to an outlier score, our model also derives an explanation of the outlierness that is useful in investigating the results. Our experiments suggest that our novel method can find different outliers than existing work and can be seen as a complement of those approaches.

Patent
06 Nov 2012
TL;DR: In speech processing systems, compensation is made for sudden changes in the background noise in the average signal-to-noise ratio (SNR) calculation SNR outlier filtering may be used, alone or in conjunction with weighting the average SNR as discussed by the authors.
Abstract: In speech processing systems, compensation is made for sudden changes in the background noise in the average signal-to-noise ratio (SNR) calculation SNR outlier filtering may be used, alone or in conjunction with weighting the average SNR Adaptive weights may be applied on the SNRs per band before computing the average SNR The weighting function can be a function of noise level, noise type, and/or instantaneous SNR value Another weighting mechanism applies a null filtering or outlier filtering which sets the weight in a particular band to be zero This particular band may be characterized as the one that exhibits an SNR that is several times higher than the SNRs in other bands

Proceedings ArticleDOI
12 Aug 2012
TL;DR: An integrated optimization framework is proposed which conducts outlier-aware community matching across snapshots and identification of evolutionary outliers in a tightly coupled way and a coordinate descent algorithm is proposed to improve community matching and outlier detection performance iteratively.
Abstract: Temporal datasets, in which data evolves continuously, exist in a wide variety of applications, and identifying anomalous or outlying objects from temporal datasets is an important and challenging task. Different from traditional outlier detection, which detects objects that have quite different behavior compared with the other objects, temporal outlier detection tries to identify objects that have different evolutionary behavior compared with other objects. Usually objects form multiple communities, and most of the objects belonging to the same community follow similar patterns of evolution. However, there are some objects which evolve in a very different way relative to other community members, and we define such objects as evolutionary community outliers. This definition represents a novel type of outliers considering both temporal dimension and community patterns. We investigate the problem of identifying evolutionary community outliers given the discovered communities from two snapshots of an evolving dataset. To tackle the challenges of community evolution and outlier detection, we propose an integrated optimization framework which conducts outlier-aware community matching across snapshots and identification of evolutionary outliers in a tightly coupled way. A coordinate descent algorithm is proposed to improve community matching and outlier detection performance iteratively. Experimental results on both synthetic and real datasets show that the proposed approach is highly effective in discovering interesting evolutionary community outliers.

Journal ArticleDOI
TL;DR: The developed outlier-aware PCA framework is versatile to accommodate novel and scalable algorithms to: i) track the low-rank signal subspace robustly, as new data are acquired in real time; and ii) determine principal components robustly in infinite-dimensional feature spaces.
Abstract: Principal component analysis (PCA) is widely used for dimensionality reduction, with well-documented merits in various applications involving high-dimensional data, including computer vision, preference measurement, and bioinformatics. In this context, the fresh look advocated here permeates benefits from variable selection and compressive sampling, to robustify PCA against outliers. A least-trimmed squares estimator of a low-rank bilinear factor analysis model is shown closely related to that obtained from an l0-(pseudo)norm-regularized criterion encouraging sparsity in a matrix explicitly modeling the outliers. This connection suggests robust PCA schemes based on convex relaxation, which lead naturally to a family of robust estimators encompassing Huber's optimal M-class as a special case. Outliers are identified by tuning a regularization parameter, which amounts to controlling sparsity of the outlier matrix along the whole robustification path of (group) least-absolute shrinkage and selection operator (Lasso) solutions. Beyond its ties to robust statistics, the developed outlier-aware PCA framework is versatile to accommodate novel and scalable algorithms to: i) track the low-rank signal subspace robustly, as new data are acquired in real time; and ii) determine principal components robustly in (possibly) infinite-dimensional feature spaces. Synthetic and real data tests corroborate the effectiveness of the proposed robust PCA schemes, when used to identify aberrant responses in personality assessment surveys, as well as unveil communities in social networks, and intruders from video surveillance data.

Proceedings ArticleDOI
10 Dec 2012
TL;DR: A robust NMF method based on the correntropy induced metric, which is much more insensitive to outliers is proposed, and a half-quadratic optimization algorithm is developed to solve the proposed problem efficiently.
Abstract: Nonnegative matrix factorization (NMF) is a popular technique for learning parts-based representation and data clustering. It usually uses the squared residuals to quantify the quality of factorization, which is optimal specifically to zero-mean, Gaussian noise and sensitive to outliers in general cases. In this paper, we propose a robust NMF method based on the correntropy induced metric, which is much more insensitive to outliers. A half-quadratic optimization algorithm is developed to solve the proposed problem efficiently. The proposed method is further extended to handle outlier rows by incorporating structural knowledge about the outliers. Experimental results on data sets with and without apparent outliers demonstrate the effectiveness of the proposed algorithms.

Book ChapterDOI
01 Jan 2012
TL;DR: This chapter is about getting familiar with the data, which includes studying the various attribute types, which include nominal attributes, binary attributes, ordinal attributes, and numeric attributes.
Abstract: Publisher Summary This chapter is about getting familiar with the data. Knowledge about the data is useful for data preprocessing, the first major task of the data mining process. The various attribute types are studied. These include nominal attributes, binary attributes, ordinal attributes, and numeric attributes. Basic statistical descriptions can be used to learn more about each attribute's values. Given a temperature attribute, one can determine its mean (average value), median (middle value), and mode (most common value). These are measures of central tendency, which give us an idea of the “middle” or center of distribution. Knowing such basic statistics regarding each attribute makes it easier to fill in missing values, smooth noisy values, and spot outliers during data preprocessing. Knowledge of the attributes and attribute values can also help in fixing inconsistencies incurred during data integration. Plotting the measures of central tendency shows us if the data are symmetric or skewed. Quantile plots, histograms, and scatter plots are other graphic displays of basic statistical descriptions. These can all be useful during data preprocessing and can provide insight into areas for mining. The field of data visualization provides many additional techniques for viewing data through graphical means. These can help identify relations, trends, and biases “hidden” in unstructured data sets. The similarity/dissimilarity between objects may also be used to detect outliers in the data, or to perform nearest-neighbor classification. There are many measures for assessing similarity and dissimilarity. In general, such measures are referred to as proximity measures.

Journal ArticleDOI
TL;DR: It is demonstrated both theoretically and empirically that the resulting estimator is more efficient than the ordinary local polynomial regression (LPR) estimator in the presence of outliers or heavy-tail error distribution (such as t-distribution).
Abstract: A local modal estimation procedure is proposed for the regression function in a non-parametric regression model. A distinguishing characteristic of the proposed procedure is that it introduces an additional tuning parameter that is automatically selected using the observed data in order to achieve both robustness and efficiency of the resulting estimate. We demonstrate both theoretically and empirically that the resulting estimator is more efficient than the ordinary local polynomial regression estimator in the presence of outliers or heavy tail error distribution (such as t-distribution). Furthermore, we show that the proposed procedure is as asymptotically efficient as the local polynomial regression estimator when there are no outliers and the error distribution is a Gaussian distribution. We propose an EM type algorithm for the proposed estimation procedure. A Monte Carlo simulation study is conducted to examine the finite sample performance of the proposed method. The simulation results confirm the theoretical findings. The proposed methodology is further illustrated via an analysis of a real data example.