scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2014"


Journal ArticleDOI
TL;DR: A comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used is provided.
Abstract: In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have enabled the availability of various forms of temporal data collection mechanisms, and advances in software technology have enabled a variety of data management mechanisms. This has fueled the growth of different kinds of data sets such as data streams, spatio-temporal data, distributed streams, temporal networks, and time series data, generated by a multitude of applications. There arises a need for an organized and detailed study of the work done in the area of outlier detection with respect to such temporal datasets. In this survey, we provide a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used.

851 citations


Journal ArticleDOI
TL;DR: An open-source application, called BoxPlotR, and an associated web portal that allow rapid generation of customized box plots, which represent both the summary statistics and the distribution of the primary data in biomedical research.
Abstract: To the Editor In biomedical research, it is often necessary to compare multiple data sets with different distributions. The bar plot, or histogram, is typically used to compare data sets on the basis of simple statistical measures, usually the mean with s.d. or s.e.m. However, summary statistics alone may fail to convey underlying differences in the structure of the primary data (Fig. 1a), which may in turn lead to erroneous conclusions. The box plot, also known as the box-and-whisker plot, represents both the summary statistics and the distribution of the primary data. The box plot thus enables visualization of the minimum, lower quartile, median, upper quartile and maximum of any data set (Fig. 1b). The first documented description of a box plot–like graph by Spear1 defined a range bar to show the median and interquartile range (IQR, or middle 50%) of a data set, with whiskers extended to minimum and maximum values. The most common implementation of the box plot, as defined by Tukey2, has a box that represents the IQR, with whiskers that extend 1.5 times the IQR from the box edges; it also allows for identification of outliers in the data set. Whiskers can also be defined to span the 95% central range of the data3. Other variations, including bean plots4 and violin plots, reveal additional details of the data distribution. These latter variants are less statistically informative but allow better visualization of the data distribution, such as bimodality (Fig. 1b), that may be hidden in a standard box plot. Figure 1 Data visualization with box plots Despite the obvious advantages of the box plot for simultaneous representation of data set and statistical parameters, this method is not in common use, in part because few available software tools allow the facile generation of box plots. For example, the standard spreadsheet tool Excel is unable to generate box plots. Here we describe an open-source application, called BoxPlotR, and an associated web portal that allow rapid generation of customized box plots. A user-defined data matrix is uploaded as a file or pasted directly into the application to generate a basic box plot with options for additional features. Sample size may be represented by the width of each box in proportion to the square root of the number of observations5. Whiskers may be defined according to the criteria of Spear1, Tukey2 or Altman3. The underlying data distribution may be visualized as a violin or bean plot or, alternatively, the actual data may be displayed as overlapping or nonoverlapping points. The 95% confidence interval that two medians are different may be illustrated as notches defined as ±(1.58 × IQR/√n) (ref. 5). There is also an op on to plot the sample means and their confidence intervals. More complex statistical comparisons may be required to ascertain significance according to the specific experimental design6. The output plots may be labeled; customized by color, dimensions and orientation; and exported as publication-quality .eps, .pdf or .svg files. To help ensure that generated plots are accurately described in publications, the application generates a description of the plot for incorporation into a figure legend. The interactive web application is written in R (ref. 7) with the R packages shiny, beanplot4, vioplot, beeswarm and RColorBrewer, and it is hosted on a shiny server to allow for interactive data analysis. User data are held only temporarily and discarded as soon as the session terminates. BoxPlotR is available at http://boxplot.tyerslab.com/ and may be downloaded to run locally or as a virtual machine for VMware and VirtualBox.

633 citations


Journal ArticleDOI
TL;DR: It is concluded that in species that exhibit IBD or have undergone range expansion, many of the published FST outliers based on FDIST2 and BayeScan are probably false positives, but FLK and Bayenv2 show great promise for accurately identifying loci under spatially divergent selection.
Abstract: FST outlier tests are a potentially powerful way to detect genetic loci under spatially divergent selection. Unfortunately, the extent to which these tests are robust to nonequilibrium demographic histories has been understudied. We developed a landscape genetics simulator to test the effects of isolation by distance (IBD) and range expansion on FST outlier methods. We evaluated the two most commonly used methods for the identification of FST outliers (FDIST2 and BayeScan, which assume samples are evolutionarily independent) and two recent methods (FLK and Bayenv2, which estimate and account for evolutionary nonindependence). Parameterization with a set of neutral loci (‘neutral parameterization’) always improved the performance of FLK and Bayenv2, while neutral parameterization caused FDIST2 to actually perform worse in the cases of IBD or range expansion. BayeScan was improved when the prior odds on neutrality was increased, regardless of the true odds in the data. On their best performance, however, the widely used methods had high false-positive rates for IBD and range expansion and were outperformed by methods that accounted for evolutionary nonindependence. In addition, default settings in FDIST2 and BayeScan resulted in many false positives suggesting balancing selection. However, all methods did very well if a large set of neutral loci is available to create empirical P-values. We conclude that in species that exhibit IBD or have undergone range expansion, many of the published FST outliers based on FDIST2 and BayeScan are probably false positives, but FLK and Bayenv2 show great promise for accurately identifying loci under spatially divergent selection.

493 citations


Journal ArticleDOI
TL;DR: It is found that it is possible to greatly reduce error rates by considering the results of all three methods when identifying outlier loci, and the relative ranking between the methods is impacted by the consideration of polygenic selection.
Abstract: The recent availability of next-generation sequencing (NGS) has made possible the use of dense genetic markers to identify regions of the genome that may be under the influence of selection. Several statistical methods have been developed recently for this purpose. Here, we present the results of an individual-based simulation study investigating the power and error rate of popular or recent genome scan methods: linear regression, Bayescan, BayEnv and LFMM. Contrary to previous studies, we focus on complex, hierarchical population structure and on polygenic selection. Additionally, we use a false discovery rate (FDR)-based framework, which provides an unified testing framework across frequentist and Bayesian methods. Finally, we investigate the influence of population allele frequencies versus individual genotype data specification for LFMM and the linear regression. The relative ranking between the methods is impacted by the consideration of polygenic selection, compared to a monogenic scenario. For strongly hierarchical scenarios with confounding effects between demography and environmental variables, the power of the methods can be very low. Except for one scenario, Bayescan exhibited moderate power and error rate. BayEnv performance was good under nonhierarchical scenarios, while LFMM provided the best compromise between power and error rate across scenarios. We found that it is possible to greatly reduce error rates by considering the results of all three methods when identifying outlier loci.

288 citations


Journal ArticleDOI
TL;DR: A half-quadratic (HQ) framework to solve the robust sparse representation problem is developed and it is shown that the ℓ1-regularization solved by soft-thresholding function has a dual relationship to Huber M-estimator, which theoretically guarantees the performance of robust sparse Representation in terms of M-ESTimation.
Abstract: Robust sparse representation has shown significant potential in solving challenging problems in computer vision such as biometrics and visual surveillance. Although several robust sparse models have been proposed and promising results have been obtained, they are either for error correction or for error detection, and learning a general framework that systematically unifies these two aspects and explores their relation is still an open problem. In this paper, we develop a half-quadratic (HQ) framework to solve the robust sparse representation problem. By defining different kinds of half-quadratic functions, the proposed HQ framework is applicable to performing both error correction and error detection. More specifically, by using the additive form of HQ, we propose an l1-regularized error correction method by iteratively recovering corrupted data from errors incurred by noises and outliers; by using the multiplicative form of HQ, we propose an l1-regularized error detection method by learning from uncorrupted data iteratively. We also show that the l1-regularization solved by soft-thresholding function has a dual relationship to Huber M-estimator, which theoretically guarantees the performance of robust sparse representation in terms of M-estimation. Experiments on robust face recognition under severe occlusion and corruption validate our framework and findings.

257 citations


Journal ArticleDOI
TL;DR: The core ingredients for building an outlier ensemble are focused on, the first steps taken in the literature are discussed, and challenges for future research are identified.
Abstract: Ensembles for unsupervised outlier detection is an emerging topic that has been neglected for a surprisingly long time (although there are reasons why this is more difficult than supervised ensembles or even clustering ensembles). Aggarwal recently discussed algorithmic patterns of outlier detection ensembles, identified traces of the idea in the literature, and remarked on potential as well as unlikely avenues for future transfer of concepts from supervised ensembles. Complementary to his points, here we focus on the core ingredients for building an outlier ensemble, discuss the first steps taken in the literature, and identify challenges for future research.

250 citations


Journal ArticleDOI
TL;DR: The proposed mode ensemble operator is found to produce the most accurate forecasts, followed by the median, while the mean has relatively poor performance, suggesting that the mode operator should be considered as an alternative to the mean and median operators in forecasting applications.
Abstract: The combination of forecasts resulting from an ensemble of neural networks has been shown to outperform the use of a single ''best'' network model. This is supported by an extensive body of literature, which shows that combining generally leads to improvements in forecasting accuracy and robustness, and that using the mean operator often outperforms more complex methods of combining forecasts. This paper proposes a mode ensemble operator based on kernel density estimation, which unlike the mean operator is insensitive to outliers and deviations from normality, and unlike the median operator does not require symmetric distributions. The three operators are compared empirically and the proposed mode ensemble operator is found to produce the most accurate forecasts, followed by the median, while the mean has relatively poor performance. The findings suggest that the mode operator should be considered as an alternative to the mean and median operators in forecasting applications. Experiments indicate that mode ensembles are useful in automating neural network models across a large number of time series, overcoming issues of uncertainty associated with data sampling, the stochasticity of neural network training, and the distribution of the forecasts.

230 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: This work introduces a novel user-oriented approach to infer user preference by the so-called focus attributes through a set of user-provided exemplar nodes and shows the effectiveness and scalability of the method on synthetic and real-world graphs, as compared to both existing graph clustering and outlier detection approaches.
Abstract: Graph clustering and graph outlier detection have been studied extensively on plain graphs, with various applications. Recently, algorithms have been extended to graphs with attributes as often observed in the real-world. However, all of these techniques fail to incorporate the user preference into graph mining, and thus, lack the ability to steer algorithms to more interesting parts of the attributed graph. In this work, we overcome this limitation and introduce a novel user-oriented approach for mining attributed graphs. The key aspect of our approach is to infer user preference by the so-called focus attributes through a set of user-provided exemplar nodes. In this new problem setting, clusters and outliers are then simultaneously mined according to this user preference. Specifically, our FocusCO algorithm identifies the focus, extracts focused clusters and detects outliers. Moreover, FocusCO scales well with graph size, since we perform a local clustering of interest to the user rather than global partitioning of the entire graph. We show the effectiveness and scalability of our method on synthetic and real-world graphs, as compared to both existing graph clustering and outlier detection approaches.

222 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: Since the outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, the solution achieves speed gains of more than 100× compared to RANSAC strategies.
Abstract: We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it in- tegrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead, and sec- ond, its scalability to arbitrarily large number of correspon- dences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous sys- tem where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100× compared to RANSAC strategies. An extensive exper- imental evaluation will show that our solution yields accu- rate results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5ms.

161 citations


Journal ArticleDOI
TL;DR: In this article, a robust Kalman filter scheme is proposed to resist the influence of the outliers in the observations, where a judging index is defined as the square of the Mahalanobis distance from the observation to its prediction.
Abstract: A robust Kalman filter scheme is proposed to resist the influence of the outliers in the observations. Two kinds of observation error are studied, i.e., the outliers in the actual observations and the heavy-tailed distribution of the observation noise. Either of the two kinds of errors can seriously degrade the performance of the standard Kalman filter. In the proposed method, a judging index is defined as the square of the Mahalanobis distance from the observation to its prediction. By assuming that the observation is Gaussian distributed with the mean and covariance being the observation prediction and its associate covariance, the judging index should be Chi-square distributed with the dimension of the observation vector as the degree of freedom. Hypothesis test is performed to the actual observation by treating the above Gaussian distribution as the null hypothesis and the judging index as the test statistic. If the null hypothesis should be rejected, it is concluded that outliers exist in the observations. In the presence of outliers scaling factors can be introduced to rescale the covariance of the observation noise or of the innovation vector, both resulting in a decreased filter gain. And the scaling factors can be solved using the Newton’s iterative method or in an analytical manner. The harmful influence of either of the two kinds of errors can be effectively resisted in the proposed method, so robustness can be achieved. Moreover, as the number of iterations needed in the iterative method may be rather large, the analytically calculated scaling factor should be preferred.

159 citations


Journal ArticleDOI
TL;DR: In this article, the authors describe an observation-driven model, based on an exponential generalized beta distribution of the second kind (EGB2), in which the signal is a linear function of past values of the score of the conditional distribution.
Abstract: A time series model in which the signal is buried in non-Gaussian noise may throw up observations that are outliers when judged by the Gaussian yardstick. We describe an observation-driven model, based on an exponential generalized beta distribution of the second kind (EGB2), in which the signal is a linear function of past values of the score of the conditional distribution. This specification produces a model that is not only easy to implement, but that also facilitates the development of a comprehensive and relatively straightforward theory for the asymptotic distribution of the maximum likelihood estimator. The model is fitted to US macroeconomic time series and compared with Gaussian and Student-t models. A theory is then developed for an EGARCH model based on the EGB2 distribution and the model is fitted to exchange rate data. Finally, dynamic location and scale models are combined and applied to data on the UK rate of inflation.

Proceedings Article
01 Jan 2014
TL;DR: A generalization of density-based outlier detection methods based on kernel density estimation that allows for the integration of domain knowledge and specific requirements and demonstrates the flexible applicability and scalability of the method on large real world data sets.
Abstract: We analyse the interplay of density estimation and outlier detection in density-based outlier detection. By clear and principled decoupling of both steps, we formulate a generalization of density-based outlier detection methods based on kernel density estimation. Embedded in a broader framework for outlier detection, the resulting method can be easily adapted to detect novel types of outliers: while common outlier detection methods are designed for detecting objects in sparse areas of the data set, our method can be modified to also detect unusual local concentrations or trends in the data set if desired. It allows for the integration of domain knowledge and specific requirements. We demonstrate the flexible applicability and scalability of the method on large real world data sets.

Book
01 Apr 2014
TL;DR: In this article, the authors present an organized picture of both recent and past research in temporal outlier detection and highlight the importance of temporal data and provide a taxonomy of proposed techniques.
Abstract: Outlier (or anomaly) detection is a very broad field which has been studied in the context of a large number of research areas like statistics, data mining, sensor networks, environmental science, distributed systems, spatio-temporal mining, etc. Initial research in outlier detection focused on time series-based outliers (in statistics). Since then, outlier detection has been studied on a large variety of data types including high-dimensional data, uncertain data, stream data, network data, time series data, spatial data, and spatio-temporal data. While there have been many tutorials and surveys for general outlier detection, we focus on outlier detection for temporal data in this book. A large number of applications generate temporal datasets. For example, in our everyday life, various kinds of records like credit, personnel, financial, judicial, medical, etc., are all temporal. This stresses the need for an organized and detailed study of outliers with respect to such temporal data. In the past decade, there has been a lot of research on various forms of temporal data including consecutive data snapshots, series of data snapshots and data streams. Besides the initial work on time series, researchers have focused on rich forms of data including multiple data streams, spatio-temporal data, network data, community distribution data, etc. Compared to general outlier detection, techniques for temporal outlier detection are very different. In this book, we will present an organized picture of both recent and past research in temporal outlier detection. We start with the basics and then ramp up the reader to the main ideas in state-of-the-art outlier detection techniques. We motivate the importance of temporal outlier detection and brief the challenges beyond usual outlier detection. Then, we list down a taxonomy of proposed techniques for temporal outlier detection. Such techniques broadly include statistical techniques (like AR models, Markov models, histograms, neural networks), distance- and density-based approaches, grouping-based approaches (clustering, community detection), network-based approaches, and spatio-temporal outlier detection approaches. We summarize by presenting a wide collection of applications where temporal outlier detection techniques have been applied to discover interesting outliers.

Journal ArticleDOI
TL;DR: This work uses robust M-estimation techniques to limit the influence of outliers, more specifically a modified version of the iterative closest point algorithm where it is used iteratively re-weighed least squares to incorporate the robustness.
Abstract: Registration of point sets is done by finding a rotation and translation that produces a best fit between a set of data points and a set of model points. We use robust M-estimation techniques to limit the influence of outliers, more specifically a modified version of the iterative closest point algorithm where we use iteratively re-weighed least squares to incorporate the robustness. We prove convergence with respect to the value of the objective function for this algorithm. A comparison is also done of different criterion functions to figure out their abilities to do appropriate point set fits, when the sets of data points contains outliers. The robust methods prove to be superior to least squares minimization in this setting.

Journal ArticleDOI
TL;DR: In this paper, an observation-driven model, based on a conditional Student's t-distribution, was proposed for rail travel in the United Kingdom, which is tractable and retains some of the desirable features of the linear Gaussian model.
Abstract: An unobserved components model in which the signal is buried in noise that is non-Gaussian may throw up observations that, when judged by the Gaussian yardstick, are outliers. We describe an observation-driven model, based on a conditional Student’s t-distribution, which is tractable and retains some of the desirable features of the linear Gaussian model. Letting the dynamics be driven by the score of the conditional distribution leads to a specification that is not only easy to implement, but which also facilitates the development of a comprehensive and relatively straightforward theory for the asymptotic distribution of the maximum likelihood estimator. The methods are illustrated with an application to rail travel in the United Kingdom. The final part of the article shows how the model may be extended to include explanatory variables.

Journal ArticleDOI
TL;DR: This paper proposes a new framework, termed robust low-rank representation, which is more robust to various noises (illumination, occlusion, etc) than LRR, and also outperforms other state-of-the-art methods.
Abstract: Recently the low-rank representation (LRR) has been successfully used in exploring the multiple subspace structures of data. It assumes that the observed data is drawn from several low-rank subspaces and sometimes contaminated by outliers and occlusions. However, the noise (low-rank representation residual) is assumed to be sparse, which is generally characterized by minimizing the l1 -norm of the residual. This actually assumes that the residual follows the Laplacian distribution. The Laplacian assumption, however, may not be accurate enough to describe various noises in real scenarios. In this paper, we propose a new framework, termed robust low-rank representation, by considering the low-rank representation as a low-rank constrained estimation for the errors in the observed data. This framework aims to find the maximum likelihood estimation solution of the low-rank representation residuals. We present an efficient iteratively reweighted inexact augmented Lagrange multiplier algorithm to solve the new problem. Extensive experimental results show that our framework is more robust to various noises (illumination, occlusion, etc) than LRR, and also outperforms other state-of-the-art methods.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This paper presents a novel probability continuous outlier model (PCOM) to depict the continuous outliers that occur in the linear representation model and designs an effective observation likelihood function and a simple update scheme for visual tracking.
Abstract: In this paper, we present a novel online visual tracking method based on linear representation. First, we present a novel probability continuous outlier model (PCOM) to depict the continuous outliers that occur in the linear representation model. In the proposed model, the element of the noisy observation sample can be either represented by a PCA subspace with small Guassian noise or treated as an arbitrary value with a uniform prior, in which the spatial consistency prior is exploited by using a binary Markov random field model. Then, we derive the objective function of the PCOM method, the solution of which can be iteratively obtained by the outlier-free least squares and standard max-flow/min-cut steps. Finally, based on the proposed PCOM method, we design an effective observation likelihood function and a simple update scheme for visual tracking. Both qualitative and quantitative evaluations demonstrate that our tracker achieves very favorable performance in terms of both accuracy and speed.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a dual data-driven PCA/SIMCA (DD-SIMCA) approach to construct a two-level decision area with extreme and outlier thresholds, both in case of regular data set and in the presence of outliers.
Abstract: For the construction of a reliable decision area in the soft independent modeling by class analogy (SIMCA) method, it is necessary to analyze calibration data revealing the objects of special types such as extremes and outliers. For this purpose, a thorough statistical analysis of the scores and orthogonal distances is necessary. The distance values should be considered as any data acquired in the experiment, and their distributions are estimated by a data-driven method, such as a method of moments or similar. The scaled chi-squared distribution seems to be the first candidate among the others in such an assessment. This provides the possibility of constructing a two-level decision area, with the extreme and outlier thresholds, both in case of regular data set and in the presence of outliers. We suggest the application of classical principal component analysis (PCA) with further use of enhanced robust estimators both for the scaling factor and for the number of degrees of freedom. A special diagnostic tool called extreme plot is proposed for the analyses of calibration objects. Extreme objects play an important role in data analysis. These objects are a mandatory attribute of any data set. The advocated dual data-driven PCA/SIMCA (DD-SIMCA) approach has demonstrated a proper performance in the analysis of simulated and real-world data for both regular and contaminated cases. DD-SIMCA has also been compared with robust principal component analysis, which is a fully robust method. Copyright © 2013 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
01 Mar 2014
TL;DR: This work proposes the first general framework to handle the three major classes of distance-based outliers in streaming environments, including the traditional distance-threshold based and the nearest-neighbor-based definitions, and designs an outlier detection strategy proven to be optimal in CPU costs.
Abstract: The discovery of distance-based outliers from huge volumes of streaming data is critical for modern applications ranging from credit card fraud detection to moving object monitoring. In this work, we propose the first general framework to handle the three major classes of distance-based outliers in streaming environments, including the traditional distance-threshold based and the nearest-neighbor-based definitions. Our LEAP framework encompasses two general optimization principles applicable across all three outlier types. First, our “minimal probing” principle uses a lightweight probing operation to gather minimal yet sufficient evidence for outlier detection. This principle overturns the state-of-the-art methodology that requires routinely conducting expensive complete neighborhood searches to identify outliers. Second, our “lifespan-aware prioritization” principle leverages the temporal relationships among stream data points to prioritize the processing order among them during the probing process. Guided by these two principles, we design an outlier detection strategy which is proven to be optimal in CPU costs needed to determine the outlier status of any data point during its entire life. Our comprehensive experimental studies, using both synthetic as well as real streaming data, demonstrate that our methods are 3 orders of magnitude faster than state-of-the-art methods for a rich diversity of scenarios tested yet scale to high dimensional streaming data.

Journal ArticleDOI
TL;DR: In this paper, the shape outliers are defined as those curves that exhibit a different shape from the rest of the data and thus difficult to detect, whereas magnitude outliers, that is, curves that lie outside the range of the majority of data, are in general easy to identify.
Abstract: SUMMARY We propose a new method to visualize and detect shape outliers in samples of curves. In functional data analysis, we observe curves defined over a given real interval and shape outliers may be defined as those curves that exhibit a different shape from the rest of the sample. Whereas magnitude outliers, that is, curves that lie outside the range of the majority of the data, are in general easy to identify, shape outliers are often masked among the rest of the curves and thus difficult to detect. In this article, we exploit the relationship between two measures of depth for functional data to help to visualize curves in terms of shape and to develop an algorithm for shape outlier detection. We illustrate the use of the visualization tool, the outliergram, through several examples and analyze the performance of the algorithm on a simulation study. Finally, we apply our method to assess cluster quality in a real set of time course microarray data.

Journal ArticleDOI
TL;DR: In a model of population divergence, the hierarchical Bayesian factor model can achieve a 2-fold or more reduction of false discovery rate compared with the software BayeScan or with an FST approach and can handle large data sets by analyzing the single nucleotide polymorphisms of the Human Genome Diversity Project.
Abstract: There is a considerable impetus in population genomics to pinpoint loci involved in local adaptation. A powerful approach to find genomic regions subject to local adaptation is to genotype numerous molecular markers and look for outlier loci. One of the most common approaches for selection scans is based on statistics that measure population differentiation such as FST. However, there are important caveats with approaches related to FST because they require grouping individuals into populations and they additionally assume a particular model of population structure. Here, we implement a more flexible individual-based approach based on Bayesian factor models. Factor models capture population structure with latent variables called factors, which can describe clustering of individuals into populations or isolation-by-distance patterns. Using hierarchical Bayesian modeling, we both infer population structure and identify outlier loci that are candidates for local adaptation. In order to identify outlier loci, the hierarchical factor model searches for loci that are atypically related to population structure as measured by the latent factors. In a model of population divergence, we show that it can achieve a 2-fold or more reduction of false discovery rate compared with the software BayeScan or with an FST approach. We show that our software can handle large data sets by analyzing the single nucleotide polymorphisms of the Human Genome Diversity Project. The Bayesian factor model is implemented in the open-source PCAdapt software.

Journal ArticleDOI
TL;DR: In this article, the handling of outliers in the context of independent samples t tests applied to nonnormal sum scores is discussed, and it is shown that removing outliers based on commonly used Z value thresholds severely increases the Type I error rate.
Abstract: In psychology, outliers are often excluded before running an independent samples t test, and data are often nonnormal because of the use of sum scores based on tests and questionnaires. This article concerns the handling of outliers in the context of independent samples t tests applied to nonnormal sum scores. After reviewing common practice, we present results of simulations of artificial and actual psychological data, which show that the removal of outliers based on commonly used Z value thresholds severely increases the Type I error rate. We found Type I error rates of above 20% after removing outliers with a threshold value of Z = 2 in a short and difficult test. Inflations of Type I error rates are particularly severe when researchers are given the freedom to alter threshold values of Z after having seen the effects thereof on outcomes. We recommend the use of nonparametric Mann-Whitney-Wilcoxon tests or robust Yuen-Welch tests without removing outliers. These alternatives to independent samples t tests are found to have nominal Type I error rates with a minimal loss of power when no outliers are present in the data and to have nominal Type I error rates and good power when outliers are present.

Proceedings ArticleDOI
Wei Liu1, Gang Hua1, John R. Smith1
23 Jun 2014
TL;DR: The proposed one-class learning approach is considerably superior to the state-of-the-arts in obliterating outliers from contaminated one class of images, exhibiting strong robustness at a high outlier proportion up to 60%.
Abstract: Outliers are pervasive in many computer vision and pattern recognition problems. Automatically eliminating outliers scattering among practical data collections becomes increasingly important, especially for Internet inspired vision applications. In this paper, we propose a novel one-class learning approach which is robust to contamination of input training data and able to discover the outliers that corrupt one class of data source. Our approach works under a fully unsupervised manner, differing from traditional one-class learning supervised by known positive labels. By design, our approach optimizes a kernel-based max-margin objective which jointly learns a large margin one-class classifier and a soft label assignment for inliers and outliers. An alternating optimization algorithm is then designed to iteratively refine the classifier and the labeling, achieving a provably convergent solution in only a few iterations. Extensive experiments conducted on four image datasets in the presence of artificial and real-world outliers demonstrate that the proposed approach is considerably superior to the state-of-the-arts in obliterating outliers from contaminated one class of images, exhibiting strong robustness at a high outlier proportion up to 60%.

Journal ArticleDOI
TL;DR: Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated.
Abstract: In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated.

Journal ArticleDOI
TL;DR: In this article, a new approach for indexing multigrain diffraction data is presented, based on the use of a monochromatic beam simultaneously illuminating all grains, which is suitable for online analysis during synchrotron sessions.
Abstract: A new approach for indexing multigrain diffraction data is presented. It is based on the use of a monochromatic beam simultaneously illuminating all grains. By operating in sub-volumes of Rodrigues space, a powerful vertex-finding algorithm can be applied, with a running time that is compatible with online analysis. The resulting program, GrainSpotter, is sufficiently fast to enable online analysis during synchrotron sessions. The program applies outlier rejection schemes, leading to more robust and accurate data. By simulations it is shown that several thousand grains can be retrieved. A new method to derive partial symmetries, called pseudo-twins, is introduced. Uniquely, GrainSpotter includes an analysis of pseudo-twins, which is shown to be critical to avoid erroneous grains resulting from the indexing.

Journal ArticleDOI
TL;DR: A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with $N$, which admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers.
Abstract: Community detection, which aims to cluster $N$ nodes in a given graph into $r$ distinct groups based on the observed undirected edges, is an important problem in network data analysis. In this paper, the popular stochastic block model (SBM) is extended to the generalized stochastic block model (GSBM) that allows for adversarial outlier nodes, which are connected with the other nodes in the graph in an arbitrary way. Under this model, we introduce a procedure using convex optimization followed by $k$-means algorithm with $k=r$. Both theoretical and numerical properties of the method are analyzed. A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with $N$. This theoretical result admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers. Numerical results show that our method is both computationally fast and robust to different kinds of outliers, while some popular computationally fast community detection algorithms, such as spectral clustering applied to adjacency matrices or graph Laplacians, may fail to retrieve the major clusters due to a small portion of outliers. We apply a slight modification of our method to a political blogs data set, showing that our method is competent in practice and comparable to existing computationally feasible methods in the literature. To the best of the authors' knowledge, our result is the first in the literature in terms of clustering communities with fast growing numbers under the GSBM where a portion of arbitrary outlier nodes exist.

Journal ArticleDOI
TL;DR: In this paper, the authors present M estimation, S estimation, and MM estimation in robust regression to determine a regression model, which is an extension of the maximum likelihood method and is a robust estimation.
Abstract: In regression analysis the use of least squares method would not be appropriate in solving problem containing outlier or extreme observations. So we need a parameter estimation method which is robust where the value of the estimation is not much affected by small changes in the data. In this paper we present M estimation, S estimation and MM estimation in robust regression to determine a regression model. M estimation is an extension of the maximum likelihood method and is a robust estimation, while S estimation and MM estimation are the development of M estimation method. The algorithm of these methods is presented and then we apply them on the maize production

Journal ArticleDOI
TL;DR: A novel feature extraction technique for micro-Doppler classification and its real-time implementation using a support vector machine classifier on a low-cost, embedded digital signal processor are presented.
Abstract: In this paper a novel feature extraction technique for micro-Doppler classification and its real time implementation using SVM on an embedded low-cost DSP are presented. The effectiveness of the proposed technique is improved through the exploitation of the outlier rejection capabilities of the Robust PCA in place of the classic PCA.

Posted ContentDOI
18 Oct 2014-bioRxiv
TL;DR: Ancestry Composition is described, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals and achieves high precision and recall for labeling chromosomesomal segments across over 25 different populations worldwide.
Abstract: Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.

Journal ArticleDOI
TL;DR: A time series outlier detection method for hydrologic data that can be used to identify data that deviate from historical patterns that performs fast, incremental evaluation of data as it becomes available, scales to large quantities of data, and requires no preclassification of anomalies.
Abstract: In order to detect outliers in hydrological time series data for improving data quality and decision-making quality related to design, operation, and management of water resources, this research develops a time series outlier detection method for hydrologic data that can be used to identify data that deviate from historical patterns. The method first built a forecasting model on the history data and then used it to predict future values. Anomalies are assumed to take place if the observed values fall outside a given prediction confidence interval (PCI), which can be calculated by the predicted value and confidence coefficient. The use of PCI as threshold is mainly on the fact that it considers the uncertainty in the data series parameters in the forecasting model to address the suitable threshold selection problem. The method performs fast, incremental evaluation of data as it becomes available, scales to large quantities of data, and requires no preclassification of anomalies. Experiments with different hydrologic real-world time series showed that the proposed methods are fast and correctly identify abnormal data and can be used for hydrologic time series analysis.