scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2010"


Journal ArticleDOI
TL;DR: A probabilistic method, called the Coherent Point Drift (CPD) algorithm, is introduced for both rigid and nonrigid point set registration and a fast algorithm is introduced that reduces the method computation complexity to linear.
Abstract: Point set registration is a key component in many computer vision tasks. The goal of point set registration is to assign correspondences between two sets of points and to recover the transformation that maps one point set to the other. Multiple factors, including an unknown nonrigid spatial transformation, large dimensionality of point set, noise, and outliers, make the point set registration a challenging problem. We introduce a probabilistic method, called the Coherent Point Drift (CPD) algorithm, for both rigid and nonrigid point set registration. We consider the alignment of two point sets as a probability density estimation problem. We fit the Gaussian mixture model (GMM) centroids (representing the first point set) to the data (the second point set) by maximizing the likelihood. We force the GMM centroids to move coherently as a group to preserve the topological structure of the point sets. In the rigid case, we impose the coherence constraint by reparameterization of GMM centroid locations with rigid parameters and derive a closed form solution of the maximization step of the EM algorithm in arbitrary dimensions. In the nonrigid case, we impose the coherence constraint by regularizing the displacement field and using the variational calculus to derive the optimal transformation. We also introduce a fast algorithm that reduces the method computation complexity to linear. We test the CPD algorithm for both rigid and nonrigid transformations in the presence of noise, outliers, and missing points, where CPD shows accurate results and outperforms current state-of-the-art methods.

2,429 citations


Journal ArticleDOI
TL;DR: Standard diagnostic procedures developed for linear regression analyses are extended to the meta-analytic fixed- and random/mixed-effects models to illustrate the usefulness of these procedures in various research settings.
Abstract: The presence of outliers and influential cases may affect the validity and robustness of the conclusions from a meta-analysis. While researchers generally agree that it is necessary to examine outlier and influential case diagnostics when conducting a meta-analysis, limited studies have addressed how to obtain such diagnostic measures in the context of a meta-analysis. The present paper extends standard diagnostic procedures developed for linear regression analyses to the meta-analytic fixed- and random/mixed-effects models. Three examples are used to illustrate the usefulness of these procedures in various research settings. Issues related to these diagnostic procedures in meta-analysis are also discussed. Copyright © 2010 John Wiley & Sons, Ltd.

1,335 citations


Journal ArticleDOI
TL;DR: This paper presents a method based on robust statistics to register images in the presence of differences, such as jaw movement, differential MR distortions and true anatomical change, which is highly accurate and shows superior robustness with respect to noise, to intensity scaling and outliers.

1,132 citations


Journal ArticleDOI
TL;DR: In this article, the authors provide a comprehensive overview of existing outlier detection techniques specifically developed for the wireless sensor networks and present a technique-based taxonomy and a comparative table to be used as a guideline to select a technique suitable for the application at hand.
Abstract: In the field of wireless sensor networks, those measurements that significantly deviate from the normal pattern of sensed data are considered as outliers. The potential sources of outliers include noise and errors, events, and malicious attacks on the network. Traditional outlier detection techniques are not directly applicable to wireless sensor networks due to the nature of sensor data and specific requirements and limitations of the wireless sensor networks. This survey provides a comprehensive overview of existing outlier detection techniques specifically developed for the wireless sensor networks. Additionally, it presents a technique-based taxonomy and a comparative table to be used as a guideline to select a technique suitable for the application at hand based on characteristics such as data type, outlier type, outlier identity, and outlier degree.

738 citations


Proceedings Article
06 Dec 2010
TL;DR: In this paper, an efficient convex optimization-based algorithm called Outlier Pursuit is presented, which under mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace, and identifies the corrupted points.
Abstract: Singular Value Decomposition (and Principal Component Analysis) is one of the most widely used techniques for dimensionality reduction: successful and efficiently computable, it is nevertheless plagued by a well-known, well-documented sensitivity to outliers. Recent work has considered the setting where each point has a few arbitrarily corrupted components. Yet, in applications of SVD or PCA such as robust collaborative filtering or bioinformatics, malicious agents, defective genes, or simply corrupted or contaminated experiments may effectively yield entire points that are completely corrupted. We present an efficient convex optimization-based algorithm we call Outlier Pursuit, that under some mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace, and identifies the corrupted points. Such identification of corrupted points that do not conform to the low-dimensional approximation, is of paramount interest in bioinformatics and financial applications, and beyond. Our techniques involve matrix decomposition using nuclear norm minimization, however, our results, setup, and approach, necessarily differ considerably from the existing line of work in matrix completion and matrix decomposition, since we develop an approach to recover the correct column space of the uncorrupted matrix, rather than the exact matrix itself.

590 citations


Journal ArticleDOI
TL;DR: In this paper, various techniques aimed at detecting potential outliers are reviewed and these techniques are subdivided into two classes, the ones regarding univariate data and those addressing multivariate data.
Abstract: Outliers are observations or measures that are suspicious because they are much smaller or much larger than the vast majority of the observations. These observations are problematic because they may not be caused by the mental process under scrutiny or may not reflect the ability under examination. The problem is that a few outliers is sometimes enough to distort the group results (by altering the mean performance, by increasing variability, etc.). In this paper, various techniques aimed at detecting potential outliers are reviewed. These techniques are subdivided into two classes, the ones regarding univariate data and those addressing multivariate data. Within these two classes, we consider the cases where the population distribution is known to be normal, the population is not normal but known, or the population is unknown. Recommendations will be put forward in each case.

494 citations


Proceedings ArticleDOI
21 Jun 2010
TL;DR: This paper proposes a novel approach for estimating the egomotion of the vehicle from a sequence of stereo images which is directly based on the trifocal geometry between image triples, thus no time expensive recovery of the 3-dimensional scene structure is needed.
Abstract: A common prerequisite for many vision-based driver assistance systems is the knowledge of the vehicle's own movement. In this paper we propose a novel approach for estimating the egomotion of the vehicle from a sequence of stereo images. Our method is directly based on the trifocal geometry between image triples, thus no time expensive recovery of the 3-dimensional scene structure is needed. The only assumption we make is a known camera geometry, where the calibration may also vary over time. We employ an Iterated Sigma Point Kalman Filter in combination with a RANSAC-based outlier rejection scheme which yields robust frame-to-frame motion estimation even in dynamic environments. A high-accuracy inertial navigation system is used to evaluate our results on challenging real-world video sequences. Experiments show that our approach is clearly superior compared to other filtering techniques in terms of both, accuracy and run-time.

456 citations


Journal ArticleDOI
TL;DR: This work proposes new tools for visualizing large amounts of functional data in the form of smooth curves, including functional versions of the bagplot and boxplot, which make use of the first two robust principal component scores, Tukey's data depth and highest density regions.
Abstract: We propose new tools for visualizing large amounts of functional data in the form of smooth curves. The proposed tools include functional versions of the bagplot and boxplot, which make use of the first two robust principal component scores, Tukey’s data depth and highest density regions. By-products of our graphical displays are outlier detection methods for functional data. We compare these new outlier detection methods with existing methods for detecting outliers in functional data, and show that our methods are better able to identify outliers. An R-package containing computer code and datasets is available in the online supplements.

303 citations


Posted Content
TL;DR: In this article, the authors consider general, heterogeneous, and arbitrarily covariant two-dimensional uncertainties, and situations in which there are bad data (large outliers), unknown uncertainties and unknown but expected intrinsic scatter in the linear relationship being fit, and emphasize the importance of having a generative model for the data.
Abstract: We go through the many considerations involved in fitting a model to data, using as an example the fit of a straight line to a set of points in a two-dimensional plane Standard weighted least-squares fitting is only appropriate when there is a dimension along which the data points have negligible uncertainties, and another along which all the uncertainties can be described by Gaussians of known variance; these conditions are rarely met in practice We consider cases of general, heterogeneous, and arbitrarily covariant two-dimensional uncertainties, and situations in which there are bad data (large outliers), unknown uncertainties, and unknown but expected intrinsic scatter in the linear relationship being fit Above all we emphasize the importance of having a "generative model" for the data, even an approximate one Once there is a generative model, the subsequent fitting is non-arbitrary because the model permits direct computation of the likelihood of the parameters or the posterior probability distribution Construction of a posterior probability distribution is indispensible if there are "nuisance parameters" to marginalize away

278 citations


Journal ArticleDOI
TL;DR: In this article, a robust rank correlation screening (RRCS) method is proposed to deal with ultra-high dimensional data, which is based on the Kendall \tau correlation coefficient between response and predictor variables rather than the Pearson correlation.
Abstract: Independence screening is a variable selection method that uses a ranking criterion to select significant variables, particularly for statistical models with nonpolynomial dimensionality or "large p, small n" paradigms when p can be as large as an exponential of the sample size n. In this paper we propose a robust rank correlation screening (RRCS) method to deal with ultra-high dimensional data. The new procedure is based on the Kendall \tau correlation coefficient between response and predictor variables rather than the Pearson correlation of existing methods. The new method has four desirable features compared with existing independence screening methods. First, the sure independence screening property can hold only under the existence of a second order moment of predictor variables, rather than exponential tails or alikeness, even when the number of predictor variables grows as fast as exponentially of the sample size. Second, it can be used to deal with semiparametric models such as transformation regression models and single-index models under monotonic constraint to the link function without involving nonparametric estimation even when there are nonparametric functions in the models. Third, the procedure can be largely used against outliers and influence points in the observations. Last, the use of indicator functions in rank correlation screening greatly simplifies the theoretical derivation due to the boundedness of the resulting statistics, compared with previous studies on variable screening. Simulations are carried out for comparisons with existing methods and a real data example is analyzed.

265 citations


Proceedings ArticleDOI
25 Jul 2010
TL;DR: This paper proposes an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers, and applies the model on both synthetic data and DBLP data sets to demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.
Abstract: Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called "information networks"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.

Journal ArticleDOI
TL;DR: The PHoto-z Accuracy Testing Program (PHAT) as mentioned in this paper is an international initiative to test and compare different methods of photo-z estimation, and the test data sets are publicly available and can be used to compare new, upcoming methods to established ones and help in guiding future photoz method development.
Abstract: Context. Photometric redshifts (photo-z's) have become an essential tool in extragalactic astronomy. Many current and upcoming observing programmes require great accuracy of photo-z's to reach their scientific goals. Aims. Here we introduce PHAT, the PHoto-z Accuracy Testing programme, an international initiative to test and compare different methods of photo-z estimation. Methods. Two different test environments are set up, one (PHAT0) based on simulations to test the basic functionality of the different photo-z codes, and another one (PHAT1) based on data from the GOODS survey including 18-band photometry and similar to 2000 spectroscopic redshifts. Results. The accuracy of the different methods is expressed and ranked by the global photo-z bias, scatter, and outlier rates. While most methods agree very well on PHAT0 there are differences in the handling of the Lyman-alpha forest for higher redshifts. Furthermore, different methods produce photo-z scatters that can differ by up to a factor of two even in this idealised case. A larger spread in accuracy is found for PHAT1. Few methods benefit from the addition of mid-IR photometry. The accuracy of the other methods is unaffected or suffers when IRAC data are included. Remaining biases and systematic effects can be explained by shortcomings in the different template sets (especially in the mid-IR) and the use of priors on the one hand and an insufficient training set on the other hand. Some strategies to overcome these problems are identified by comparing the methods in detail. Scatters of 4-8% in Delta z/(1 + z) were obtained, consistent with other studies. However, somewhat larger outlier rates (\textgreater 7.5% with Delta z/(1 + z) \textgreater 0.15; \textgreater 4.5% after cleaning) are found for all codes that can only partly be explained by AGN or issues in the photometry or the spec-z catalogue. Some outliers were probably missed in comparisons of photo-z's to other, less complete spectroscopic surveys in the past. There is a general trend that empirical codes produce smaller biases than template-based codes. Conclusions. The systematic, quantitative comparison of different photo-z codes presented here is a snapshot of the current state-of-the-art of photo-z estimation and sets a standard for the assessment of photo-z accuracy in the future. The rather large outlier rates reported here for PHAT1 on real data should be investigated further since they are most probably also present (and possibly hidden) in many other studies. The test data sets are publicly available and can be used to compare new, upcoming methods to established ones and help in guiding future photo-z method development.

Journal ArticleDOI
TL;DR: This paper presents a robust mixture modeling framework using theMultivariate skew t distributions, an extension of the multivariate Student’s t family with additional shape parameters to regulate skewness, which results in a very complicated likelihood.
Abstract: This paper presents a robust mixture modeling framework using the multivariate skew t distributions, an extension of the multivariate Student's t family with additional shape parameters to regulate skewness The proposed model results in a very complicated likelihood Two variants of Monte Carlo EM algorithms are developed to carry out maximum likelihood estimation of mixture parameters In addition, we offer a general information-based method for obtaining the asymptotic covariance matrix of maximum likelihood estimates Some practical issues including the selection of starting values as well as the stopping criterion are also discussed The proposed methodology is applied to a subset of the Australian Institute of Sport data for illustration

Posted Content
TL;DR: A thresholding based iterative procedure for outlier detection (Θ–IPOD) based on hard thresholding correctly identifies outliers on some hard test problems and is much faster than iteratively reweighted least squares for large data, because each iteration costs at most O(np) (and sometimes much less), avoiding an O( np2) least squares estimate.
Abstract: This paper studies the outlier detection problem from the point of view of penalized regressions. Our regression model adds one mean shift parameter for each of the $n$ data points. We then apply a regularization favoring a sparse vector of mean shift parameters. The usual $L_1$ penalty yields a convex criterion, but we find that it fails to deliver a robust estimator. The $L_1$ penalty corresponds to soft thresholding. We introduce a thresholding (denoted by $\Theta$) based iterative procedure for outlier detection ($\Theta$-IPOD). A version based on hard thresholding correctly identifies outliers on some hard test problems. We find that $\Theta$-IPOD is much faster than iteratively reweighted least squares for large data because each iteration costs at most $O(np)$ (and sometimes much less) avoiding an $O(np^2)$ least squares estimate. We describe the connection between $\Theta$-IPOD and $M$-estimators. Our proposed method has one tuning parameter with which to both identify outliers and estimate regression coefficients. A data-dependent choice can be made based on BIC. The tuned $\Theta$-IPOD shows outstanding performance in identifying outliers in various situations in comparison to other existing approaches. This methodology extends to high-dimensional modeling with $p\gg n$, if both the coefficient vector and the outlier pattern are sparse.

Journal ArticleDOI
TL;DR: The results show that the proposed methods outperform standard imputation methods in the presence of outliers, and the model-based method with robust regressions is preferable.

Journal ArticleDOI
TL;DR: In this article, a companion robust R 2 estimator is proposed, which is robust to deviations from the specified regression model (like the presence of outliers), it is efficient if the errors are normally distributed, and it does not make any assumption on the distribution of the explanatory variables (and therefore no assumption on an unconditional distribution of responses).

Journal ArticleDOI
TL;DR: A two-phase approach to restore images corrupted by blur and impulse noise by using a variational method to identify the outlier candidates—the pixels that are likely to be corrupted by impulse noise.
Abstract: In this paper, we propose a two-phase approach to restore images corrupted by blur and impulse noise. In the first phase, we identify the outlier candidates--the pixels that are likely to be corrupted by impulse noise. We consider that the remaining data pixels are essentially free of outliers. Then in the second phase, the image is deblurred and denoised simultaneously by a variational method by using the essentially outlier-free data. The experiments show several dB's improvement in PSNR with respect to the typical variational methods.

Posted Content
TL;DR: In this paper, a companion robust R2 estimator is proposed, which is robust to deviations from the specified regression model (like the presence of outliers), it is efficient if the errors are normally distributed, and it does not make any assumption on the distribution of the explanatory variables (and therefore no assumption on an unconditional distribution of responses).
Abstract: To assess the quality of the fit in a multiple linear regression, the coefficient of determination or R2 is a very simple tool, yet the most used by practitioners. Indeed, it is reported in most statistical analyzes, and although it is not recommended as a final model selection tool, it provides an indication of the suitability of the chosen explanatory variables in predicting the response. In the classical setting, it is well known that the least-squares fit and coefficient of determination can be arbitrary and/or misleading in the presence of a single outlier. In many applied settings, the assumption of normality of the errors and the absence of outliers are difficult to establish. In these cases, robust procedures for estimation and inference in linear regression are available and provide a suitable alternative. In this paper we present a companion robust coefficient of determination that has several desirable properties not shared by others. It is robust to deviations from the specified regression model (like the presence of outliers), it is efficient if the errors are normally distributed, it does not make any assumption on the distribution of the explanatory variables (and therefore no assumption on the unconditional distribution of the responses). We also show that it is a consistent estimator of the population coefficient of determination. A simulation study and two real datasets support the appropriateness of this estimator, compared with classical (leastsquares) and several previously proposed robust R2, even for small sample sizes.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that this regression scheme achieves higher estimation accuracy and better generalizing ability than Support Vector Regression (SVR).

Journal ArticleDOI
TL;DR: This work compared classical exploratory factor analysis with a robust counterpart which is less influenced by data outliers and data heterogeneities and revealed that robust exploratory factors analysis is more stable than the classical method.

Posted Content
TL;DR: This work presents an efficient convex optimization-based algorithm that it calls outlier pursuit, which under some mild assumptions on the uncorrupted points recovers the exact optimal low-dimensional subspace and identifies the corrupted points.
Abstract: Singular Value Decomposition (and Principal Component Analysis) is one of the most widely used techniques for dimensionality reduction: successful and efficiently computable, it is nevertheless plagued by a well-known, well-documented sensitivity to outliers. Recent work has considered the setting where each point has a few arbitrarily corrupted components. Yet, in applications of SVD or PCA such as robust collaborative filtering or bioinformatics, malicious agents, defective genes, or simply corrupted or contaminated experiments may effectively yield entire points that are completely corrupted. We present an efficient convex optimization-based algorithm we call Outlier Pursuit, that under some mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace, and identifies the corrupted points. Such identification of corrupted points that do not conform to the low-dimensional approximation, is of paramount interest in bioinformatics and financial applications, and beyond. Our techniques involve matrix decomposition using nuclear norm minimization, however, our results, setup, and approach, necessarily differ considerably from the existing line of work in matrix completion and matrix decomposition, since we develop an approach to recover the correct column space of the uncorrupted matrix, rather than the exact matrix itself. In any problem where one seeks to recover a structure rather than the exact initial matrices, techniques developed thus far relying on certificates of optimality, will fail. We present an important extension of these methods, that allows the treatment of such problems.

Journal ArticleDOI
13 Dec 2010
TL;DR: A new method for estimating SWS by considering a solution space of trajectories and evaluating each trajectory using a metric that characterizes wave motion along the entire trajectory, suitable for use in situations requiring real-time feedback and comparably robust to the RANSAC algorithm with respect to outlier data.
Abstract: Time-of-flight methods allow quantitative measurement of shear wave speed (SWS) from ultrasonically tracked displacements following impulsive excitation in tissue. However, application of these methods to in vivo data are challenging because of the presence of gross outlier data resulting from sources such as physiological motion or spatial inhomogeneities. This paper describes a new method for estimating SWS by considering a solution space of trajectories and evaluating each trajectory using a metric that characterizes wave motion along the entire trajectory. The metric used here is found by summing displacement data along the trajectory as in the calculation of projection data in the Radon transformation. The algorithm is evaluated using data acquired in calibrated phantoms and in vivo human liver. Results are compared with SWS estimates using a random sample consensus (RANSAC) algorithm described by Wang et al. Good agreement is found between the Radon sum and RANSAC SWS estimates with a correlation coefficient of greater than 0.99 for phantom data and 0.91 for in vivo liver data. The Radon sum transformation is suitable for use in situations requiring real-time feedback and is comparably robust to the RANSAC algorithm with respect to outlier data.

Journal ArticleDOI
TL;DR: A novel technique for estimating normals on unorganized point clouds that is capable to deal with points located in high curvature regions or near/on complex sharp features, while being highly robust with respect to noise and outliers.

Journal ArticleDOI
TL;DR: Compared to conventional mean regression, quantile regression can characterize the entire conditional distribution of the outcome variable, and is more robust to outliers and misspecification of the error distribution.
Abstract: We study quantile regression (QR) for longitudinal measurements with nonignorable intermittent missing data and dropout. Compared to conventional mean regression, quantile regression can characterize the entire conditional distribution of the outcome variable, and is more robust to outliers and misspecification of the error distribution. We account for the within-subject correlation by introducing a l(2) penalty in the usual QR check function to shrink the subject-specific intercepts and slopes toward the common population values. The informative missing data are assumed to be related to the longitudinal outcome process through the shared latent random effects. We assess the performance of the proposed method using simulation studies, and illustrate it with data from a pediatric AIDS clinical trial.

Journal ArticleDOI
01 Jun 2010-Genetics
TL;DR: A general method for applying ABC to Bayesian hierarchical models is developed and applied to detect microsatellite loci influenced by local selection, and it is demonstrated using receiver operating characteristic (ROC) analysis that this approach has comparable performance to a full-likelihood method and outperforms it when mutation rates are variable across loci.
Abstract: We address the problem of finding evidence of natural selection from genetic data, accounting for the confounding effects of demographic history. In the absence of natural selection, gene genealogies should all be sampled from the same underlying distribution, often approximated by a coalescent model. Selection at a particular locus will lead to a modified genealogy, and this motivates a number of recent approaches for detecting the effects of natural selection in the genome as “outliers” under some models. The demographic history of a population affects the sampling distribution of genealogies, and therefore the observed genotypes and the classification of outliers. Since we cannot see genealogies directly, we have to infer them from the observed data under some model of mutation and demography. Thus the accuracy of an outlier-based approach depends to a greater or a lesser extent on the uncertainty about the demographic and mutational model. A natural modeling framework for this type of problem is provided by Bayesian hierarchical models, in which parameters, such as mutation rates and selection coefficients, are allowed to vary across loci. It has proved quite difficult computationally to implement fully probabilistic genealogical models with complex demographies, and this has motivated the development of approximations such as approximate Bayesian computation (ABC). In ABC the data are compressed into summary statistics, and computation of the likelihood function is replaced by simulation of data under the model. In a hierarchical setting one may be interested both in hyperparameters and parameters, and there may be very many of the latter—for example, in a genetic model, these may be parameters describing each of many loci or populations. This poses a problem for ABC in that one then requires summary statistics for each locus, which, if used naively, leads to a consequent difficulty in conditional density estimation. We develop a general method for applying ABC to Bayesian hierarchical models, and we apply it to detect microsatellite loci influenced by local selection. We demonstrate using receiver operating characteristic (ROC) analysis that this approach has comparable performance to a full-likelihood method and outperforms it when mutation rates are variable across loci.

Journal ArticleDOI
TL;DR: In this article, the authors developed multivariate outlier tests based on the high-breakdown Minimum Covariance Determinant estimator, which has good performance under the null hypothesis of no outliers in the data and also appreciable power properties for the purpose of individual outlier detection.
Abstract: In this paper we develop multivariate outlier tests based on the high-breakdown Minimum Covariance Determinant estimator. The rules that we propose have good performance under the null hypothesis of no outliers in the data and also appreciable power properties for the purpose of individual outlier detection. This achievement is made possible by two orders of improvement over the currently available methodology. First, we suggest an approximation to the exact distribution of robust distances from which cut-off values can be obtained even in small samples. Our thresholds are accurate, simple to implement and result in more powerful outlier identification rules than those obtained by calibrating the asymptotic distribution of distances. The second power improvement comes from the addition of a new iteration step after one-step reweighting of the estimator. The proposed methodology is motivated by asymptotic distributional results. Its finite sample performance is evaluated through simulations and compared to...

Journal ArticleDOI
TL;DR: Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods, sampling by uncertainty and density (SUD) and density-based re-ranking.
Abstract: To solve the knowledge bottleneck problem, active learning has been widely used for its ability to automatically select the most informative unlabeled examples for human annotation. One of the key enabling techniques of active learning is uncertainty sampling, which uses one classifier to identify unlabeled examples with the least confidence. Uncertainty sampling often presents problems when outliers are selected. To solve the outlier problem, this paper presents two techniques, sampling by uncertainty and density (SUD) and density-based re-ranking. Both techniques prefer not only the most informative example in terms of uncertainty criterion, but also the most representative example in terms of density criterion. Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods.

Journal ArticleDOI
TL;DR: Application of phenix.model_vs-data to the contents of the Protein Data Bank shows that the vast majority of deposited structures can be automatically analyzed to reproduce the reported quality statistics, but the small fraction that elude automated re-analysis highlight areas where new software developments can help retain valuable information for future analysis.
Abstract: phenix.model_vs_data is a high-level command-line tool for the computation of crystallographic model and data statistics, and the evaluation of the fit of the model to data. Analysis of all Protein Data Bank structures that have experimental data available shows that in most cases the reported statistics, in particular R factors, can be reproduced within a few percentage points. However, there are a number of outliers where the recomputed R values are significantly different from those originally reported. The reasons for these discrepancies are discussed.

Journal ArticleDOI
TL;DR: An outlier detection procedure which applies a nonparametric model accounting for undesired outputs and exogenous influences in the sample to exploit the singularity of the leverage and the peer count, the super-efficiency and the order-m method and thepeer index.
Abstract: This paper suggests an outlier detection procedure which applies a nonparametric model accounting for undesired outputs and exogenous influences in the sample. Although efficiency is estimated in a deterministic frontier approach, each potential outlier initially benefits of the doubt of not being an outlier. We survey several outlier detection procedures and select five complementary methodologies which, taken together, are able to detect all influential observations. To exploit the singularity of the leverage and the peer count, the super-efficiency and the order-m method and the peer index, it is proposed to select these observations as outliers which are simultaneously revealed as atypical by at least two of the procedures. A simulated example demonstrates the usefulness of this approach. The model is applied to the Portuguese drinking water sector, for which we have an unusually rich data set.

Book ChapterDOI
01 Apr 2010
TL;DR: This paper proposes a unified framework for combining different outlier detection algorithms that is very effective in detecting outliers in the real-world context compared to other ensemble and individual approaches.
Abstract: Outlier detection has many practical applications, especially in domains that have scope for abnormal behavior. Despite the importance of detecting outliers, defining outliers in fact is a nontrivial task which is normally application-dependent. On the other hand, detection techniques are constructed around the chosen definitions. As a consequence, available detection techniques vary significantly in terms of accuracy, performance and issues of the detection problem which they address. In this paper, we propose a unified framework for combining different outlier detection algorithms. Unlike existing work, our approach combines non-compatible techniques of different types to improve the outlier detection accuracy compared to other ensemble and individual approaches. Through extensive empirical studies, our framework is shown to be very effective in detecting outliers in the real-world context.