Showing papers on "Outlier published in 2010"

PDF

Open Access

Journal Article•DOI•

Point Set Registration: Coherent Point Drift

[...]

Andriy Myronenko¹, Xubo Song¹•Institutions (1)

01 Dec 2010-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A probabilistic method, called the Coherent Point Drift (CPD) algorithm, is introduced for both rigid and nonrigid point set registration and a fast algorithm is introduced that reduces the method computation complexity to linear.

...read moreread less

Abstract: Point set registration is a key component in many computer vision tasks. The goal of point set registration is to assign correspondences between two sets of points and to recover the transformation that maps one point set to the other. Multiple factors, including an unknown nonrigid spatial transformation, large dimensionality of point set, noise, and outliers, make the point set registration a challenging problem. We introduce a probabilistic method, called the Coherent Point Drift (CPD) algorithm, for both rigid and nonrigid point set registration. We consider the alignment of two point sets as a probability density estimation problem. We fit the Gaussian mixture model (GMM) centroids (representing the first point set) to the data (the second point set) by maximizing the likelihood. We force the GMM centroids to move coherently as a group to preserve the topological structure of the point sets. In the rigid case, we impose the coherence constraint by reparameterization of GMM centroid locations with rigid parameters and derive a closed form solution of the maximization step of the EM algorithm in arbitrary dimensions. In the nonrigid case, we impose the coherence constraint by regularizing the displacement field and using the variational calculus to derive the optimal transformation. We also introduce a fast algorithm that reduces the method computation complexity to linear. We test the CPD algorithm for both rigid and nonrigid transformations in the presence of noise, outliers, and missing points, where CPD shows accurate results and outperforms current state-of-the-art methods.

...read moreread less

2,429 citations

Journal Article•DOI•

Outlier and influence diagnostics for meta‐analysis

[...]

Wolfgang Viechtbauer¹, Mike W.-L. Cheung²•Institutions (2)

Maastricht University¹, National University of Singapore²

01 Apr 2010-Research Synthesis Methods

TL;DR: Standard diagnostic procedures developed for linear regression analyses are extended to the meta-analytic fixed- and random/mixed-effects models to illustrate the usefulness of these procedures in various research settings.

...read moreread less

Abstract: The presence of outliers and influential cases may affect the validity and robustness of the conclusions from a meta-analysis. While researchers generally agree that it is necessary to examine outlier and influential case diagnostics when conducting a meta-analysis, limited studies have addressed how to obtain such diagnostic measures in the context of a meta-analysis. The present paper extends standard diagnostic procedures developed for linear regression analyses to the meta-analytic fixed- and random/mixed-effects models. Three examples are used to illustrate the usefulness of these procedures in various research settings. Issues related to these diagnostic procedures in meta-analysis are also discussed. Copyright © 2010 John Wiley & Sons, Ltd.

...read moreread less

1,335 citations

Journal Article•DOI•

Highly accurate inverse consistent registration: a robust approach.

[...]

Martin Reuter¹, H. Diana Rosas¹, Bruce Fischl², Bruce Fischl¹•Institutions (2)

Harvard University¹, Massachusetts Institute of Technology²

01 Dec 2010-NeuroImage

TL;DR: This paper presents a method based on robust statistics to register images in the presence of differences, such as jaw movement, differential MR distortions and true anatomical change, which is highly accurate and shows superior robustness with respect to noise, to intensity scaling and outliers.

...read moreread less

1,132 citations

Journal Article•DOI•

Outlier Detection Techniques for Wireless Sensor Networks: A Survey

[...]

Yang Zhang¹, Nirvana Meratnia¹, Paul J.M. Havinga¹•Institutions (1)

University of Twente¹

01 Apr 2010-IEEE Communications Surveys and Tutorials

TL;DR: In this article, the authors provide a comprehensive overview of existing outlier detection techniques specifically developed for the wireless sensor networks and present a technique-based taxonomy and a comparative table to be used as a guideline to select a technique suitable for the application at hand.

...read moreread less

Abstract: In the field of wireless sensor networks, those measurements that significantly deviate from the normal pattern of sensed data are considered as outliers. The potential sources of outliers include noise and errors, events, and malicious attacks on the network. Traditional outlier detection techniques are not directly applicable to wireless sensor networks due to the nature of sensor data and specific requirements and limitations of the wireless sensor networks. This survey provides a comprehensive overview of existing outlier detection techniques specifically developed for the wireless sensor networks. Additionally, it presents a technique-based taxonomy and a comparative table to be used as a guideline to select a technique suitable for the application at hand based on characteristics such as data type, outlier type, outlier identity, and outlier degree.

...read moreread less

738 citations

Proceedings Article•

Robust PCA via Outlier Pursuit

[...]

Huan Xu¹, Constantine Caramanis¹, Sujay Sanghavi¹•Institutions (1)

University of Texas at Austin¹

06 Dec 2010

TL;DR: In this paper, an efficient convex optimization-based algorithm called Outlier Pursuit is presented, which under mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace, and identifies the corrupted points.

...read moreread less

Abstract: Singular Value Decomposition (and Principal Component Analysis) is one of the most widely used techniques for dimensionality reduction: successful and efficiently computable, it is nevertheless plagued by a well-known, well-documented sensitivity to outliers. Recent work has considered the setting where each point has a few arbitrarily corrupted components. Yet, in applications of SVD or PCA such as robust collaborative filtering or bioinformatics, malicious agents, defective genes, or simply corrupted or contaminated experiments may effectively yield entire points that are completely corrupted. We present an efficient convex optimization-based algorithm we call Outlier Pursuit, that under some mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace, and identifies the corrupted points. Such identification of corrupted points that do not conform to the low-dimensional approximation, is of paramount interest in bioinformatics and financial applications, and beyond. Our techniques involve matrix decomposition using nuclear norm minimization, however, our results, setup, and approach, necessarily differ considerably from the existing line of work in matrix completion and matrix decomposition, since we develop an approach to recover the correct column space of the uncorrupted matrix, rather than the exact matrix itself.

...read moreread less

590 citations

Journal Article•DOI•

Outliers detection and treatment: a review.

[...]

Denis Cousineau¹, Sylvain Chartier²•Institutions (2)

Université de Montréal¹, University of Ottawa²

30 Jun 2010-International journal of psychological research

TL;DR: In this paper, various techniques aimed at detecting potential outliers are reviewed and these techniques are subdivided into two classes, the ones regarding univariate data and those addressing multivariate data.

...read moreread less

Abstract: Outliers are observations or measures that are suspicious because they are much smaller or much larger than the vast majority of the observations. These observations are problematic because they may not be caused by the mental process under scrutiny or may not reflect the ability under examination. The problem is that a few outliers is sometimes enough to distort the group results (by altering the mean performance, by increasing variability, etc.). In this paper, various techniques aimed at detecting potential outliers are reviewed. These techniques are subdivided into two classes, the ones regarding univariate data and those addressing multivariate data. Within these two classes, we consider the cases where the population distribution is known to be normal, the population is not normal but known, or the population is unknown. Recommendations will be put forward in each case.

...read moreread less

494 citations

Proceedings Article•DOI•

Visual odometry based on stereo image sequences with RANSAC-based outlier rejection scheme

[...]

Bernd Kitt¹, Andreas Geiger¹, Henning Lategahn¹•Institutions (1)

Karlsruhe Institute of Technology¹

21 Jun 2010

TL;DR: This paper proposes a novel approach for estimating the egomotion of the vehicle from a sequence of stereo images which is directly based on the trifocal geometry between image triples, thus no time expensive recovery of the 3-dimensional scene structure is needed.

...read moreread less

Abstract: A common prerequisite for many vision-based driver assistance systems is the knowledge of the vehicle's own movement. In this paper we propose a novel approach for estimating the egomotion of the vehicle from a sequence of stereo images. Our method is directly based on the trifocal geometry between image triples, thus no time expensive recovery of the 3-dimensional scene structure is needed. The only assumption we make is a known camera geometry, where the calibration may also vary over time. We employ an Iterated Sigma Point Kalman Filter in combination with a RANSAC-based outlier rejection scheme which yields robust frame-to-frame motion estimation even in dynamic environments. A high-accuracy inertial navigation system is used to evaluate our results on challenging real-world video sequences. Experiments show that our approach is clearly superior compared to other filtering techniques in terms of both, accuracy and run-time.

...read moreread less

456 citations

Journal Article•DOI•

Rainbow Plots, Bagplots, and Boxplots for Functional Data

[...]

Rob J. Hyndman¹, Han Lin Shang¹•Institutions (1)

Monash University, Clayton campus¹

01 Jan 2010-Journal of Computational and Graphical Statistics

TL;DR: This work proposes new tools for visualizing large amounts of functional data in the form of smooth curves, including functional versions of the bagplot and boxplot, which make use of the first two robust principal component scores, Tukey's data depth and highest density regions.

...read moreread less

Abstract: We propose new tools for visualizing large amounts of functional data in the form of smooth curves. The proposed tools include functional versions of the bagplot and boxplot, which make use of the first two robust principal component scores, Tukey’s data depth and highest density regions. By-products of our graphical displays are outlier detection methods for functional data. We compare these new outlier detection methods with existing methods for detecting outliers in functional data, and show that our methods are better able to identify outliers. An R-package containing computer code and datasets is available in the online supplements.

...read moreread less

303 citations

Posted Content•

Data analysis recipes: Fitting a model to data

[...]

David W. Hogg, Jo Bovy¹, Dustin Lang²•Institutions (2)

New York University¹, Princeton University²

27 Aug 2010-arXiv: Instrumentation and Methods for Astrophysics

TL;DR: In this article, the authors consider general, heterogeneous, and arbitrarily covariant two-dimensional uncertainties, and situations in which there are bad data (large outliers), unknown uncertainties and unknown but expected intrinsic scatter in the linear relationship being fit, and emphasize the importance of having a generative model for the data.

...read moreread less

Abstract: We go through the many considerations involved in fitting a model to data, using as an example the fit of a straight line to a set of points in a two-dimensional plane Standard weighted least-squares fitting is only appropriate when there is a dimension along which the data points have negligible uncertainties, and another along which all the uncertainties can be described by Gaussians of known variance; these conditions are rarely met in practice We consider cases of general, heterogeneous, and arbitrarily covariant two-dimensional uncertainties, and situations in which there are bad data (large outliers), unknown uncertainties, and unknown but expected intrinsic scatter in the linear relationship being fit Above all we emphasize the importance of having a "generative model" for the data, even an approximate one Once there is a generative model, the subsequent fitting is non-arbitrary because the model permits direct computation of the likelihood of the parameters or the posterior probability distribution Construction of a posterior probability distribution is indispensible if there are "nuisance parameters" to marginalize away

...read moreread less

278 citations

Journal Article•DOI•

Robust rank correlation based screening

[...]

Gaorong Li¹, Heng Peng, Jun Zhang², Lixing Zhu³•Institutions (3)

Beijing University of Technology¹, Shenzhen University², Hong Kong Baptist University³

20 Dec 2010-arXiv: Methodology

TL;DR: In this article, a robust rank correlation screening (RRCS) method is proposed to deal with ultra-high dimensional data, which is based on the Kendall \tau correlation coefficient between response and predictor variables rather than the Pearson correlation.

...read moreread less

Abstract: Independence screening is a variable selection method that uses a ranking criterion to select significant variables, particularly for statistical models with nonpolynomial dimensionality or "large p, small n" paradigms when p can be as large as an exponential of the sample size n. In this paper we propose a robust rank correlation screening (RRCS) method to deal with ultra-high dimensional data. The new procedure is based on the Kendall \tau correlation coefficient between response and predictor variables rather than the Pearson correlation of existing methods. The new method has four desirable features compared with existing independence screening methods. First, the sure independence screening property can hold only under the existence of a second order moment of predictor variables, rather than exponential tails or alikeness, even when the number of predictor variables grows as fast as exponentially of the sample size. Second, it can be used to deal with semiparametric models such as transformation regression models and single-index models under monotonic constraint to the link function without involving nonparametric estimation even when there are nonparametric functions in the models. Third, the procedure can be largely used against outliers and influence points in the observations. Last, the use of indicator functions in rank correlation screening greatly simplifies the theoretical derivation due to the boundedness of the resulting statistics, compared with previous studies on variable screening. Simulations are carried out for comparisons with existing methods and a real data example is analyzed.

...read moreread less

265 citations

Proceedings Article•DOI•

On community outliers and their efficient detection in information networks

[...]

Jing Gao¹, Feng Liang¹, Wei Fan², Chi Wang¹, Yizhou Sun¹, Jiawei Han¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

25 Jul 2010

TL;DR: This paper proposes an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers, and applies the model on both synthetic data and DBLP data sets to demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.

...read moreread less

Abstract: Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called "information networks"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.

...read moreread less

Journal Article•DOI•

PHAT: PHoto-z Accuracy Testing

[...]

Hendrik Hildebrandt¹, Stéphane Arnouts, Peter Capak², Leonidas A. Moustakas², Christian Wolf³, F. B. Abdalla⁴, Roberto J. Assef⁵, Manda Banerji⁶, Narciso Benítez⁷, Gabriel B. Brammer⁸, Tamás Budavári⁹, Samuel Carliles⁹, Dan Coe², Tomas Dahlen¹⁰, Robert Feldmann¹¹, D. W. Gerdes¹², B. Gillis¹³, Olivier Ilbert, Ralf Kotulla¹⁴, Ralf Kotulla¹⁵, Ofer Lahav⁴, I-hui Li¹⁶, J. M. Miralles, Norbert Purger¹⁷, Samuel Schmidt¹⁸, Jack Singal¹⁹ - Show less +22 more•Institutions (19)

Leiden University¹, California Institute of Technology², University of Oxford³, University College London⁴, Ohio State University⁵, University of Cambridge⁶, Spanish National Research Council⁷, Yale University⁸, Johns Hopkins University⁹, Space Telescope Science Institute¹⁰, ETH Zurich¹¹, University of Michigan¹², University of Waterloo¹³, University of Wisconsin-Madison¹⁴, University of Hertfordshire¹⁵, Swinburne University of Technology¹⁶, Eötvös Loránd University¹⁷, University of California, Davis¹⁸, SLAC National Accelerator Laboratory¹⁹

01 Nov 2010-Astronomy and Astrophysics

TL;DR: The PHoto-z Accuracy Testing Program (PHAT) as mentioned in this paper is an international initiative to test and compare different methods of photo-z estimation, and the test data sets are publicly available and can be used to compare new, upcoming methods to established ones and help in guiding future photoz method development.

...read moreread less

Abstract: Context. Photometric redshifts (photo-z's) have become an essential tool in extragalactic astronomy. Many current and upcoming observing programmes require great accuracy of photo-z's to reach their scientific goals. Aims. Here we introduce PHAT, the PHoto-z Accuracy Testing programme, an international initiative to test and compare different methods of photo-z estimation. Methods. Two different test environments are set up, one (PHAT0) based on simulations to test the basic functionality of the different photo-z codes, and another one (PHAT1) based on data from the GOODS survey including 18-band photometry and similar to 2000 spectroscopic redshifts. Results. The accuracy of the different methods is expressed and ranked by the global photo-z bias, scatter, and outlier rates. While most methods agree very well on PHAT0 there are differences in the handling of the Lyman-alpha forest for higher redshifts. Furthermore, different methods produce photo-z scatters that can differ by up to a factor of two even in this idealised case. A larger spread in accuracy is found for PHAT1. Few methods benefit from the addition of mid-IR photometry. The accuracy of the other methods is unaffected or suffers when IRAC data are included. Remaining biases and systematic effects can be explained by shortcomings in the different template sets (especially in the mid-IR) and the use of priors on the one hand and an insufficient training set on the other hand. Some strategies to overcome these problems are identified by comparing the methods in detail. Scatters of 4-8% in Delta z/(1 + z) were obtained, consistent with other studies. However, somewhat larger outlier rates (\textgreater 7.5% with Delta z/(1 + z) \textgreater 0.15; \textgreater 4.5% after cleaning) are found for all codes that can only partly be explained by AGN or issues in the photometry or the spec-z catalogue. Some outliers were probably missed in comparisons of photo-z's to other, less complete spectroscopic surveys in the past. There is a general trend that empirical codes produce smaller biases than template-based codes. Conclusions. The systematic, quantitative comparison of different photo-z codes presented here is a snapshot of the current state-of-the-art of photo-z estimation and sets a standard for the assessment of photo-z accuracy in the future. The rather large outlier rates reported here for PHAT1 on real data should be investigated further since they are most probably also present (and possibly hidden) in many other studies. The test data sets are publicly available and can be used to compare new, upcoming methods to established ones and help in guiding future photo-z method development.

...read moreread less

Journal Article•DOI•

Robust mixture modeling using multivariate skew t distributions

[...]

Tsung-I Lin¹•Institutions (1)

National Chung Hsing University¹

01 Jul 2010-Statistics and Computing

TL;DR: This paper presents a robust mixture modeling framework using theMultivariate skew t distributions, an extension of the multivariate Student’s t family with additional shape parameters to regulate skewness, which results in a very complicated likelihood.

...read moreread less

Abstract: This paper presents a robust mixture modeling framework using the multivariate skew t distributions, an extension of the multivariate Student's t family with additional shape parameters to regulate skewness The proposed model results in a very complicated likelihood Two variants of Monte Carlo EM algorithms are developed to carry out maximum likelihood estimation of mixture parameters In addition, we offer a general information-based method for obtaining the asymptotic covariance matrix of maximum likelihood estimates Some practical issues including the selection of starting values as well as the stopping criterion are also discussed The proposed methodology is applied to a subset of the Australian Institute of Sport data for illustration

...read moreread less

Posted Content•

Outlier Detection Using Nonconvex Penalized Regression

[...]

Yiyuan She¹, Art B. Owen¹•Institutions (1)

Florida State University¹

14 Jun 2010-arXiv: Methodology

TL;DR: A thresholding based iterative procedure for outlier detection (Θ–IPOD) based on hard thresholding correctly identifies outliers on some hard test problems and is much faster than iteratively reweighted least squares for large data, because each iteration costs at most O(np) (and sometimes much less), avoiding an O( np2) least squares estimate.

...read moreread less

Abstract: This paper studies the outlier detection problem from the point of view of penalized regressions. Our regression model adds one mean shift parameter for each of the $n$ data points. We then apply a regularization favoring a sparse vector of mean shift parameters. The usual $L_1$ penalty yields a convex criterion, but we find that it fails to deliver a robust estimator. The $L_1$ penalty corresponds to soft thresholding. We introduce a thresholding (denoted by $\Theta$) based iterative procedure for outlier detection ($\Theta$-IPOD). A version based on hard thresholding correctly identifies outliers on some hard test problems. We find that $\Theta$-IPOD is much faster than iteratively reweighted least squares for large data because each iteration costs at most $O(np)$ (and sometimes much less) avoiding an $O(np^2)$ least squares estimate. We describe the connection between $\Theta$-IPOD and $M$-estimators. Our proposed method has one tuning parameter with which to both identify outliers and estimate regression coefficients. A data-dependent choice can be made based on BIC. The tuned $\Theta$-IPOD shows outstanding performance in identifying outliers in various situations in comparison to other existing approaches. This methodology extends to high-dimensional modeling with $p\gg n$, if both the coefficient vector and the outlier pattern are sparse.

...read moreread less

Journal Article•DOI•

Imputation of missing values for compositional data using classical and robust methods

[...]

Karel Hron, Matthias Templ¹, Peter Filzmoser¹•Institutions (1)

Vienna University of Technology¹

01 Dec 2010-Computational Statistics & Data Analysis

TL;DR: The results show that the proposed methods outperform standard imputation methods in the presence of outliers, and the model-based method with robust regressions is preferable.

...read moreread less

Journal Article•DOI•

A robust coefficient of determination for regression

[...]

Olivier Renaud¹, Maria-Pia Victoria-Feser¹•Institutions (1)

University of Geneva¹

01 Jul 2010-Journal of Statistical Planning and Inference

TL;DR: In this article, a companion robust R 2 estimator is proposed, which is robust to deviations from the specified regression model (like the presence of outliers), it is efficient if the errors are normally distributed, and it does not make any assumption on the distribution of the explanatory variables (and therefore no assumption on an unconditional distribution of responses).

...read moreread less

Journal Article•DOI•

Fast Two-Phase Image Deblurring Under Impulse Noise

[...]

Jian-Feng Cai¹, Raymond H. Chan², Mila Nikolova³•Institutions (3)

University of California, Los Angeles¹, The Chinese University of Hong Kong², Centre national de la recherche scientifique³

01 Jan 2010-Journal of Mathematical Imaging and Vision

TL;DR: A two-phase approach to restore images corrupted by blur and impulse noise by using a variational method to identify the outlier candidates—the pixels that are likely to be corrupted by impulse noise.

...read moreread less

Abstract: In this paper, we propose a two-phase approach to restore images corrupted by blur and impulse noise. In the first phase, we identify the outlier candidates--the pixels that are likely to be corrupted by impulse noise. We consider that the remaining data pixels are essentially free of outliers. Then in the second phase, the image is deblurred and denoised simultaneously by a variational method by using the essentially outlier-free data. The experiments show several dB's improvement in PSNR with respect to the typical variational methods.

...read moreread less

Posted Content•

A Robust Coefficient of Determination for Regression

[...]

Olivier Renaud¹, Maria-Pia Victoria-Feser¹•Institutions (1)

University of Geneva¹

29 Oct 2010-Social Science Research Network

TL;DR: In this paper, a companion robust R2 estimator is proposed, which is robust to deviations from the specified regression model (like the presence of outliers), it is efficient if the errors are normally distributed, and it does not make any assumption on the distribution of the explanatory variables (and therefore no assumption on an unconditional distribution of responses).

...read moreread less

Abstract: To assess the quality of the fit in a multiple linear regression, the coefficient of determination or R2 is a very simple tool, yet the most used by practitioners. Indeed, it is reported in most statistical analyzes, and although it is not recommended as a final model selection tool, it provides an indication of the suitability of the chosen explanatory variables in predicting the response. In the classical setting, it is well known that the least-squares fit and coefficient of determination can be arbitrary and/or misleading in the presence of a single outlier. In many applied settings, the assumption of normality of the errors and the absence of outliers are difficult to establish. In these cases, robust procedures for estimation and inference in linear regression are available and provide a suitable alternative. In this paper we present a companion robust coefficient of determination that has several desirable properties not shared by others. It is robust to deviations from the specified regression model (like the presence of outliers), it is efficient if the errors are normally distributed, it does not make any assumption on the distribution of the explanatory variables (and therefore no assumption on the unconditional distribution of the responses). We also show that it is a consistent estimator of the population coefficient of determination. A simulation study and two real datasets support the appropriateness of this estimator, compared with classical (leastsquares) and several previously proposed robust R2, even for small sample sizes.

...read moreread less

Journal Article•DOI•

High-dimensional pattern regression using machine learning: From medical images to continuous clinical variables

[...]

Ying Wang¹, Yong Fan¹, Priyanka Bhatt¹, Christos Davatzikos¹•Institutions (1)

University of Pennsylvania¹

01 May 2010-NeuroImage

TL;DR: Experimental results demonstrate that this regression scheme achieves higher estimation accuracy and better generalizing ability than Support Vector Regression (SVR).

...read moreread less

Journal Article•DOI•

Exploratory factor analysis revisited: How robust methods support the detection of hidden multivariate data structures in IS research

[...]

Horst Treiblmaier¹, Peter Filzmoser²•Institutions (2)

Vienna University of Economics and Business¹, Vienna University of Technology²

01 May 2010-Information & Management

TL;DR: This work compared classical exploratory factor analysis with a robust counterpart which is less influenced by data outliers and data heterogeneities and revealed that robust exploratory factors analysis is more stable than the classical method.

...read moreread less

Posted Content•

Robust PCA via Outlier Pursuit

[...]

Huan Xu¹, Constantine Caramanis², Sujay Sanghavi²•Institutions (2)

National University of Singapore¹, University of Texas at Austin²

20 Oct 2010-arXiv: Learning

TL;DR: This work presents an efficient convex optimization-based algorithm that it calls outlier pursuit, which under some mild assumptions on the uncorrupted points recovers the exact optimal low-dimensional subspace and identifies the corrupted points.

...read moreread less

Journal Article•DOI•

Robust estimation of time-of-flight shear wave speed using a radon sum transformation

[...]

Ned C. Rouze¹, Michael H. Wang¹, Mark L. Palmeri¹, Kathryn R. Nightingale¹•Institutions (1)

Duke University¹

13 Dec 2010

TL;DR: A new method for estimating SWS by considering a solution space of trajectories and evaluating each trajectory using a metric that characterizes wave motion along the entire trajectory, suitable for use in situations requiring real-time feedback and comparably robust to the RANSAC algorithm with respect to outlier data.

...read moreread less

Abstract: Time-of-flight methods allow quantitative measurement of shear wave speed (SWS) from ultrasonically tracked displacements following impulsive excitation in tissue. However, application of these methods to in vivo data are challenging because of the presence of gross outlier data resulting from sources such as physiological motion or spatial inhomogeneities. This paper describes a new method for estimating SWS by considering a solution space of trajectories and evaluating each trajectory using a metric that characterizes wave motion along the entire trajectory. The metric used here is found by summing displacement data along the trajectory as in the calculation of projection data in the Radon transformation. The algorithm is evaluated using data acquired in calibrated phantoms and in vivo human liver. Results are compared with SWS estimates using a random sample consensus (RANSAC) algorithm described by Wang et al. Good agreement is found between the Radon sum and RANSAC SWS estimates with a correlation coefficient of greater than 0.99 for phantom data and 0.91 for in vivo liver data. The Radon sum transformation is suitable for use in situations requiring real-time feedback and is comparably robust to the RANSAC algorithm with respect to outlier data.

...read moreread less

Journal Article•DOI•

Technical Section: Robust normal estimation for point clouds with sharp features

[...]

Bao Li¹, Ruwen Schnabel², Reinhard Klein², Zhi-Quan Cheng¹, Gang Dang¹, Shiyao Jin¹ - Show less +2 more•Institutions (2)

National University of Defense Technology¹, University of Bonn²

01 Apr 2010-Computers & Graphics

TL;DR: A novel technique for estimating normals on unorganized point clouds that is capable to deal with points located in high curvature regions or near/on complex sharp features, while being highly robust with respect to noise and outliers.

...read moreread less

Journal Article•DOI•

Bayesian Quantile Regression for Longitudinal Studies with Nonignorable Missing Data

[...]

Ying Yuan¹, Guosheng Yin¹•Institutions (1)

University of Texas MD Anderson Cancer Center¹

01 Mar 2010-Biometrics

TL;DR: Compared to conventional mean regression, quantile regression can characterize the entire conditional distribution of the outcome variable, and is more robust to outliers and misspecification of the error distribution.

...read moreread less

Abstract: We study quantile regression (QR) for longitudinal measurements with nonignorable intermittent missing data and dropout. Compared to conventional mean regression, quantile regression can characterize the entire conditional distribution of the outcome variable, and is more robust to outliers and misspecification of the error distribution. We account for the within-subject correlation by introducing a l(2) penalty in the usual QR check function to shrink the subject-specific intercepts and slopes toward the common population values. The informative missing data are assumed to be related to the longitudinal outcome process through the shared latent random effects. We assess the performance of the proposed method using simulation studies, and illustrate it with data from a pediatric AIDS clinical trial.

...read moreread less

Journal Article•DOI•

Likelihood-Free Inference of Population Structure and Local Adaptation in a Bayesian Hierarchical Model

[...]

Eric Bazin¹, Kevin J. Dawson², Mark A. Beaumont¹•Institutions (2)

University of Reading¹, Rothamsted Research²

01 Jun 2010-Genetics

TL;DR: A general method for applying ABC to Bayesian hierarchical models is developed and applied to detect microsatellite loci influenced by local selection, and it is demonstrated using receiver operating characteristic (ROC) analysis that this approach has comparable performance to a full-likelihood method and outperforms it when mutation rates are variable across loci.

...read moreread less

Abstract: We address the problem of finding evidence of natural selection from genetic data, accounting for the confounding effects of demographic history. In the absence of natural selection, gene genealogies should all be sampled from the same underlying distribution, often approximated by a coalescent model. Selection at a particular locus will lead to a modified genealogy, and this motivates a number of recent approaches for detecting the effects of natural selection in the genome as “outliers” under some models. The demographic history of a population affects the sampling distribution of genealogies, and therefore the observed genotypes and the classification of outliers. Since we cannot see genealogies directly, we have to infer them from the observed data under some model of mutation and demography. Thus the accuracy of an outlier-based approach depends to a greater or a lesser extent on the uncertainty about the demographic and mutational model. A natural modeling framework for this type of problem is provided by Bayesian hierarchical models, in which parameters, such as mutation rates and selection coefficients, are allowed to vary across loci. It has proved quite difficult computationally to implement fully probabilistic genealogical models with complex demographies, and this has motivated the development of approximations such as approximate Bayesian computation (ABC). In ABC the data are compressed into summary statistics, and computation of the likelihood function is replaced by simulation of data under the model. In a hierarchical setting one may be interested both in hyperparameters and parameters, and there may be very many of the latter—for example, in a genetic model, these may be parameters describing each of many loci or populations. This poses a problem for ABC in that one then requires summary statistics for each locus, which, if used naively, leads to a consequent difficulty in conditional density estimation. We develop a general method for applying ABC to Bayesian hierarchical models, and we apply it to detect microsatellite loci influenced by local selection. We demonstrate using receiver operating characteristic (ROC) analysis that this approach has comparable performance to a full-likelihood method and outperforms it when mutation rates are variable across loci.

...read moreread less

Journal Article•DOI•

Multivariate Outlier Detection With High-Breakdown Estimators

[...]

Andrea Cerioli¹•Institutions (1)

University of Parma¹

01 Mar 2010-Journal of the American Statistical Association

TL;DR: In this article, the authors developed multivariate outlier tests based on the high-breakdown Minimum Covariance Determinant estimator, which has good performance under the null hypothesis of no outliers in the data and also appreciable power properties for the purpose of individual outlier detection.

...read moreread less

Abstract: In this paper we develop multivariate outlier tests based on the high-breakdown Minimum Covariance Determinant estimator. The rules that we propose have good performance under the null hypothesis of no outliers in the data and also appreciable power properties for the purpose of individual outlier detection. This achievement is made possible by two orders of improvement over the currently available methodology. First, we suggest an approximation to the exact distribution of robust distances from which cut-off values can be obtained even in small samples. Our thresholds are accurate, simple to implement and result in more powerful outlier identification rules than those obtained by calibrating the asymptotic distribution of distances. The second power improvement comes from the addition of a new iteration step after one-step reweighting of the estimator. The proposed methodology is motivated by asymptotic distributional results. Its finite sample performance is evaluated through simulations and compared to...

...read moreread less

Journal Article•DOI•

Active Learning With Sampling by Uncertainty and Density for Data Annotations

[...]

Jingbo Zhu¹, Huizhen Wang¹, Benjamin K. Tsou², Matthew Y. Ma³•Institutions (3)

Northeastern University (China)¹, City University of Hong Kong², Princeton University³

01 Aug 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods, sampling by uncertainty and density (SUD) and density-based re-ranking.

...read moreread less

Abstract: To solve the knowledge bottleneck problem, active learning has been widely used for its ability to automatically select the most informative unlabeled examples for human annotation. One of the key enabling techniques of active learning is uncertainty sampling, which uses one classifier to identify unlabeled examples with the least confidence. Uncertainty sampling often presents problems when outliers are selected. To solve the outlier problem, this paper presents two techniques, sampling by uncertainty and density (SUD) and density-based re-ranking. Both techniques prefer not only the most informative example in terms of uncertainty criterion, but also the most representative example in terms of density criterion. Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods.

...read moreread less

Journal Article•DOI•

phenix.model_vs_data: a high-level tool for the calculation of crystallographic model and data statistics

[...]

Pavel V. Afonine¹, Ralf W. Grosse-Kunstleve¹, Vincent B. Chen², Jeffrey J. Headd², Nigel W. Moriarty¹, Jane S. Richardson², David S. Richardson², Alexandre Urzhumtsev³, Peter H. Zwart¹, Paul D. Adams¹, Paul D. Adams⁴ - Show less +7 more•Institutions (4)

Lawrence Berkeley National Laboratory¹, Duke University², French Institute of Health and Medical Research³, University of California, Berkeley⁴

01 Aug 2010-Journal of Applied Crystallography

TL;DR: Application of phenix.model_vs-data to the contents of the Protein Data Bank shows that the vast majority of deposited structures can be automatically analyzed to reproduce the reported quality statistics, but the small fraction that elude automated re-analysis highlight areas where new software developments can help retain valuable information for future analysis.

...read moreread less

Abstract: phenix.model_vs_data is a high-level command-line tool for the computation of crystallographic model and data statistics, and the evaluation of the fit of the model to data. Analysis of all Protein Data Bank structures that have experimental data available shows that in most cases the reported statistics, in particular R factors, can be reproduced within a few percentage points. However, there are a number of outliers where the recomputed R values are significantly different from those originally reported. The reasons for these discrepancies are discussed.

...read moreread less

Journal Article•DOI•

Influential observations in frontier models, a robust non-oriented approach to the water sector

[...]

Kristof De Witte¹, Kristof De Witte², Rui Cunha Marques³•Institutions (3)

Katholieke Universiteit Leuven¹, Maastricht University², Technical University of Lisbon³

29 May 2010-Annals of Operations Research

TL;DR: An outlier detection procedure which applies a nonparametric model accounting for undesired outputs and exogenous influences in the sample to exploit the singularity of the leverage and the peer count, the super-efficiency and the order-m method and thepeer index.

...read moreread less

Abstract: This paper suggests an outlier detection procedure which applies a nonparametric model accounting for undesired outputs and exogenous influences in the sample. Although efficiency is estimated in a deterministic frontier approach, each potential outlier initially benefits of the doubt of not being an outlier. We survey several outlier detection procedures and select five complementary methodologies which, taken together, are able to detect all influential observations. To exploit the singularity of the leverage and the peer count, the super-efficiency and the order-m method and the peer index, it is proposed to select these observations as outliers which are simultaneously revealed as atypical by at least two of the procedures. A simulated example demonstrates the usefulness of this approach. The model is applied to the Portuguese drinking water sector, for which we have an unusually rich data set.

...read moreread less

Book Chapter•DOI•

Mining outliers with ensemble of heterogeneous detectors on random subspaces

[...]

Hoang Vu Nguyen¹, Hock Hee Ang¹, Vivekanand Gopalkrishnan¹•Institutions (1)

Nanyang Technological University¹

01 Apr 2010

TL;DR: This paper proposes a unified framework for combining different outlier detection algorithms that is very effective in detecting outliers in the real-world context compared to other ensemble and individual approaches.

...read moreread less

Abstract: Outlier detection has many practical applications, especially in domains that have scope for abnormal behavior. Despite the importance of detecting outliers, defining outliers in fact is a nontrivial task which is normally application-dependent. On the other hand, detection techniques are constructed around the chosen definitions. As a consequence, available detection techniques vary significantly in terms of accuracy, performance and issues of the detection problem which they address. In this paper, we propose a unified framework for combining different outlier detection algorithms. Unlike existing work, our approach combines non-compatible techniques of different types to improve the outlier detection accuracy compared to other ensemble and individual approaches. Through extensive empirical studies, our framework is shown to be very effective in detecting outliers in the real-world context.

...read moreread less

Collapse