Showing papers on "Outlier published in 2016"

PDF

Open Access

Journal Article•DOI•

Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression

[...]

Belinda Phipson, Stanley Chun-Wei Lee¹, Ian J. Majewski¹, Warren S. Alexander¹, Gordon K. Smyth¹ - Show less +1 more•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 Jun 2016-The Annals of Applied Statistics

TL;DR: The robust empirical Bayes (RB) algorithm as mentioned in this paper improves the robust differential expression tests by robustifying the hyperparameter estimation procedure, which has the double benefit of reducing the chance that hypervariable genes will be spuriously identified as DE while increasing statistical power for the main body of genes.

...read moreread less

Abstract: One of the most common analysis tasks in genomic research is to identify genes that are differentially expressed (DE) between experimental conditions. Empirical Bayes (EB) statistical tests using moderated genewise variances have been very effective for this purpose, especially when the number of biological replicate samples is small. The EB procedures can however be heavily influenced by a small number of genes with very large or very small variances. This article improves the differential expression tests by robustifying the hyperparameter estimation procedure. The robust procedure has the effect of decreasing the informativeness of the prior distribution for outlier genes while increasing its informativeness for other genes. This effect has the double benefit of reducing the chance that hypervariable genes will be spuriously identified as DE while increasing statistical power for the main body of genes. The robust EB algorithm is fast and numerically stable. The procedure allows exact small-sample null distributions for the test statistics and reduces exactly to the original EB procedure when no outlier genes are present. Simulations show that the robustified tests have similar performance to the original tests in the absence of outlier genes but have greater power and robustness when outliers are present. The article includes case studies for which the robust method correctly identifies and downweights genes associated with hidden covariates and detects more genes likely to be scientifically relevant to the experimental conditions. The new procedure is implemented in the limma software package freely available from the Bioconductor repository.

...read moreread less

632 citations

Journal Article•DOI•

On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

[...]

Guilherme Oliveira Campos¹, Arthur Zimek², Jörg Sander³, Ricardo J. G. B. Campello¹, Barbora Micenková⁴, Erich Schubert², Ira Assent⁴, Michael E. Houle⁵ - Show less +4 more•Institutions (5)

University of São Paulo¹, Ludwig Maximilian University of Munich², University of Alberta³, Aarhus University⁴, National Institute of Informatics⁵

01 Jul 2016-Data Mining and Knowledge Discovery

TL;DR: An extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose, and provides a characterization of the datasets themselves.

...read moreread less

Abstract: The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier detection models, and the impact of parameter choices for these algorithms. The scarcity of appropriate benchmark datasets with ground truth annotation is a significant impediment to the evaluation of outlier methods. Even when labeled datasets are available, their suitability for the outlier detection task is typically unknown. Furthermore, the biases of commonly-used evaluation measures are not fully understood. It is thus difficult to ascertain the extent to which newly-proposed outlier detection methods improve over established methods. In this paper, we perform an extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose. Based on the overall performance of the outlier detection methods, we provide a characterization of the datasets themselves, and discuss their suitability as outlier detection benchmark sets. We also examine the most commonly-used measures for comparing the performance of different methods, and suggest adaptations that are more suitable for the evaluation of outlier detection results.

...read moreread less

552 citations

Journal Article•DOI•

Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression

[...]

Belinda Phipson, Stanley Chun-Wei Lee¹, Ian J. Majewski¹, Warren S. Alexander¹, Gordon K. Smyth¹ - Show less +1 more•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

28 Feb 2016-arXiv: Applications

TL;DR: Simulations show that the robustified tests have similar performance to the original tests in the absence of outlier genes but have greater power and robustness when outliers are present, and the robust method correctly identifies and downweights genes associated with hidden covariates and detects more genes likely to be scientifically relevant to the experimental conditions.

...read moreread less

489 citations

Journal Article•DOI•

robustlmm : An R Package for Robust Estimation of Linear Mixed-Effects Models

[...]

Manuel Koller

06 Dec 2016-Journal of Statistical Software

TL;DR: An R package, robustlmm, is introduced, designed to robustly fit linear mixed-effects models, to provide estimates where contamination has only little influence and to detect and flag contamination.

...read moreread less

Abstract: As any real-life data, data modeled by linear mixed-effects models often contain outliers or other contamination. Even little contamination can drive the classic estimates far away from what they would be without the contamination. At the same time, datasets that require mixed-effects modeling are often complex and large. This makes it difficult to spot contamination. Robust estimation methods aim to solve both problems: to provide estimates where contamination has only little influence and to detect and flag contamination. We introduce an R package, robustlmm, to robustly fit linear mixed-effects models. The package's functions and methods are designed to closely equal those offered by lme4, the R package that implements classic linear mixed-effects model estimation in R. The robust estimation method in robustlmm is based on the random effects contamination model and the central contamination model. Contamination can be detected at all levels of the data. The estimation method does not make any assumption on the data's grouping structure except that the model parameters are estimable. robustlmm supports hierarchical and non-hierarchical (e.g., crossed) grouping structures. The robustness of the estimates and their asymptotic efficiency is fully controlled through the function interface. Individual parts (e.g., fixed effects and variance components) can be tuned independently. In this tutorial, we show how to fit robust linear mixed-effects models using robustlmm, how to assess the model fit, how to detect outliers, and how to compare different fits.

...read moreread less

340 citations

Journal Article•DOI•

Non-Rigid Point Set Registration by Preserving Global and Local Structures

[...]

Jiayi Ma¹, Ji Zhao², Alan L. Yuille³•Institutions (3)

Wuhan University¹, Carnegie Mellon University², University of California, Los Angeles³

01 Jan 2016-IEEE Transactions on Image Processing

TL;DR: The robustness of this approach under various types of distortions, such as deformation, noise, outliers, rotation, and occlusion, greatly outperforms the state-of-the-art methods, especially when the data is badly degraded.

...read moreread less

Abstract: In previous work on point registration, the input point sets are often represented using Gaussian mixture models and the registration is then addressed through a probabilistic approach, which aims to exploit global relationships on the point sets. For non-rigid shapes, however, the local structures among neighboring points are also strong and stable and thus helpful in recovering the point correspondence. In this paper, we formulate point registration as the estimation of a mixture of densities, where local features, such as shape context, are used to assign the membership probabilities of the mixture model. This enables us to preserve both global and local structures during matching. The transformation between the two point sets is specified in a reproducing kernel Hilbert space and a sparse approximation is adopted to achieve a fast implementation. Extensive experiments on both synthesized and real data show the robustness of our approach under various types of distortions, such as deformation, noise, outliers, rotation, and occlusion. It greatly outperforms the state-of-the-art methods, especially when the data is badly degraded.

...read moreread less

311 citations

Proceedings Article•DOI•

AI^2: Training a Big Data Machine to Defend

[...]

Kalyan Veeramachaneni¹, Ignacio Arnaldo, Vamsi Korrapati, Constantinos Bassias, Ke Li - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

09 Apr 2016

TL;DR: The system presents four key features: a big data behavioral analytics platform, an outlier detection system, a mechanism to obtain feedback from security analysts, and a supervised learning module, which is capable of learning to defend against unseen attacks.

...read moreread less

Abstract: We present AI2, an analyst-in-the-loop security system where Analyst Intuition (AI) is put together with state-of-the-art machine learning to build a complete end-to-end Artificially Intelligent solution (AI). The system presents four key features: a big data behavioral analytics platform, an outlier detection system, a mechanism to obtain feedback from security analysts, and a supervised learning module. We validate our system with a real-world data set consisting of 3.6 billion log lines and 70.2 million entities. The results show that the system is capable of learning to defend against unseen attacks. With respect to unsupervised outlier analysis, our system improves the detection rate in 2.92× and reduces false positives by more than 5×.

...read moreread less

233 citations

Journal Article•DOI•

MIDAS robust trend estimator for accurate GPS station velocities without step detection.

[...]

Geoffrey Blewitt¹, Corné Kreemer¹, William C. Hammond¹, Julien Gazeaux², Julien Gazeaux³ - Show less +1 more•Institutions (3)

University of Nevada, Reno¹, University of Paris², Institut de Physique du Globe de Paris³

01 Mar 2016-Journal of Geophysical Research

TL;DR: Development here of Median Interannual Difference Adjusted for Skewness (MIDAS), a variant of the Theil‐Sen median trend estimator, which computes a robust and realistic estimate of trend uncertainty and has the potential for broader application in the geosciences.

...read moreread less

Abstract: Automatic estimation of velocities from GPS coordinate time series is becoming required to cope with the exponentially increasing flood of available data, but problems detectable to the human eye are often overlooked. This motivates us to find an automatic and accurate estimator of trend that is resistant to common problems such as step discontinuities, outliers, seasonality, skewness, and heteroscedasticity. Developed here, Median Interannual Difference Adjusted for Skewness (MIDAS) is a variant of the Theil-Sen median trend estimator, for which the ordinary version is the median of slopes vij = (xj-xi )/(tj-ti ) computed between all data pairs i > j. For normally distributed data, Theil-Sen and least squares trend estimates are statistically identical, but unlike least squares, Theil-Sen is resistant to undetected data problems. To mitigate both seasonality and step discontinuities, MIDAS selects data pairs separated by 1 year. This condition is relaxed for time series with gaps so that all data are used. Slopes from data pairs spanning a step function produce one-sided outliers that can bias the median. To reduce bias, MIDAS removes outliers and recomputes the median. MIDAS also computes a robust and realistic estimate of trend uncertainty. Statistical tests using GPS data in the rigid North American plate interior show ±0.23 mm/yr root-mean-square (RMS) accuracy in horizontal velocity. In blind tests using synthetic data, MIDAS velocities have an RMS accuracy of ±0.33 mm/yr horizontal, ±1.1 mm/yr up, with a 5th percentile range smaller than all 20 automatic estimators tested. Considering its general nature, MIDAS has the potential for broader application in the geosciences.

...read moreread less

211 citations

Journal Article•DOI•

Multi-Graph Matching via Affinity Optimization with Graduated Consistency Regularization

[...]

Junchi Yan¹, Minsu Cho², Hongyuan Zha³, Xiaokang Yang¹, Stephen M. Chu⁴ - Show less +1 more•Institutions (4)

Shanghai Jiao Tong University¹, École Normale Supérieure², East China Normal University³, IBM⁴

01 Jun 2016-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this paper, a composition-based multi-graph matching method is proposed to incorporate the two aspects by optimizing the affinity score, meanwhile gradually infusing the consistency, which can serve as a regularizer in the affinity objective function.

...read moreread less

Abstract: This paper addresses the problem of matching common node correspondences among multiple graphs referring to an identical or related structure This multi-graph matching problem involves two correlated components: i) the local pairwise matching affinity across pairs of graphs; ii) the global matching consistency that measures the uniqueness of the pairwise matchings by different composition orders Previous studies typically either enforce the matching consistency constraints in the beginning of an iterative optimization, which may propagate matching error both over iterations and across graph pairs; or separate affinity optimization and consistency enforcement into two steps This paper is motivated by the observation that matching consistency can serve as a regularizer in the affinity objective function especially when the function is biased due to noises or inappropriate modeling We propose composition-based multi-graph matching methods to incorporate the two aspects by optimizing the affinity score, meanwhile gradually infusing the consistency We also propose two mechanisms to elicit the common inliers against outliers Compelling results on synthetic and real images show the competency of our algorithms

...read moreread less

178 citations

Journal Article•DOI•

Distance-based outlier detection in data streams

[...]

Luan Tran¹, Liyue Fan¹, Cyrus Shahabi¹•Institutions (1)

University of Southern California¹

01 Aug 2016

TL;DR: This work systematically evaluates the most recent algorithms for DODDS under various stream settings and outlier rates, and shows that in most settings, the MCOD algorithm offers the superior performance among all the algorithms, including the most recently algorithm Thresh_LEAP.

...read moreread less

Abstract: Continuous outlier detection in data streams has important applications in fraud detection, network security, and public health. The arrival and departure of data objects in a streaming manner impose new challenges for outlier detection algorithms, especially in time and space efficiency. In the past decade, several studies have been performed to address the problem of distance-based outlier detection in data streams (DODDS), which adopts an unsupervised definition and does not have any distributional assumptions on data values. Our work is motivated by the lack of comparative evaluation among the state-of-the-art algorithms using the same datasets on the same platform. We systematically evaluate the most recent algorithms for DODDS under various stream settings and outlier rates. Our extensive results show that in most settings, the MCOD algorithm offers the superior performance among all the algorithms, including the most recent algorithm Thresh_LEAP.

...read moreread less

121 citations

Journal Article•DOI•

A non-parameter outlier detection algorithm based on Natural Neighbor

[...]

Jinlong Huang¹, Qingsheng Zhu¹, Lijun Yang¹, Ji Feng¹•Institutions (1)

Chongqing University¹

15 Jan 2016-Knowledge Based Systems

TL;DR: A novel notion that Natural Outlier Factor (NOF) to measure the outliers and provide the algorithm based on Natural Neighbor (NaN) that does not require any parameters to compute the NOF of the objects in the database is proposed.

...read moreread less

Abstract: Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. Although many Outlier detection algorithm have been proposed. However, for most of these algorithms faced a serious problem that it is very difficult to select an appropriate parameter when they run on a dataset. In this paper we use the method of Natural Neighbor to adaptively obtain the parameter, named Natural Value. We also propose a novel notion that Natural Outlier Factor (NOF) to measure the outliers and provide the algorithm based on Natural Neighbor (NaN) that does not require any parameters to compute the NOF of the objects in the database. The formal analysis and experiments show that this method can achieve good performance in outlier detection.

...read moreread less

104 citations

Journal Article•DOI•

A robust data scaling algorithm to improve classification accuracies in biomedical data

[...]

Xi Hang Cao¹, Ivan Stojkovic¹, Zoran Obradovic¹•Institutions (1)

Temple University¹

09 Sep 2016-BMC Bioinformatics

TL;DR: The proposed Generalized Logistic algorithm is simple yet effective, robust to outliers, so no additional denoising or outlier detection step is needed in data preprocessing, and empirical results show models learned from data scaled by the GL algorithm have higher accuracy compared to the commonly used data scaling algorithms.

...read moreread less

Abstract: Background Machine learning models have been adapted in biomedical research and practice for knowledge discovery and decision support. While mainstream biomedical informatics research focuses on developing more accurate models, the importance of data preprocessing draws less attention. We propose the Generalized Logistic (GL) algorithm that scales data uniformly to an appropriate interval by learning a generalized logistic function to fit the empirical cumulative distribution function of the data. The GL algorithm is simple yet effective; it is intrinsically robust to outliers, so it is particularly suitable for diagnostic/classification models in clinical/medical applications where the number of samples is usually small; it scales the data in a nonlinear fashion, which leads to potential improvement in accuracy.

...read moreread less

Journal Article•DOI•

Fast Memory Efficient Local Outlier Detection in Data Streams

[...]

Mahsa Salehi¹, Christopher Leckie¹, James C. Bezdek¹, Tharshan Vaithianathan¹, Xuyun Zhang² - Show less +1 more•Institutions (2)

University of Melbourne¹, University of Auckland²

01 Dec 2016-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A memory efficient incremental local outlier detection algorithm for data streams, and a more flexible version (MiLOF F), both have an accuracy close to iLOF but within a fixed memory bound are proposed.

...read moreread less

Abstract: Outlier detection is an important task in data mining, with applications ranging from intrusion detection to human gait analysis. With the growing need to analyze high speed data streams, the task of outlier detection becomes even more challenging as traditional outlier detection techniques can no longer assume that all the data can be stored for processing. While the well-known Local Outlier Factor (LOF) algorithm has an incremental version, it assumes unbounded memory to keep all previous data points. In this paper, we propose a memory efficient incremental local outlier (MiLOF) detection algorithm for data streams, and a more flexible version (MiLOF_F), both have an accuracy close to Incremental LOF but within a fixed memory bound. Our experimental results show that both proposed approaches have better memory and time complexity than Incremental LOF while having comparable accuracy. In addition, we show that MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. These results show that MiLOF/MiLOF_F are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams.

...read moreread less

Journal Article•DOI•

A multi-step outlier-based anomaly detection approach to network-wide traffic

[...]

Monowar H. Bhuyan¹, Dhruba K. Bhattacharyya², Jugal Kalita³•Institutions (3)

Kaziranga University¹, Tezpur University², University of Colorado Colorado Springs³

20 Jun 2016-Information Sciences

TL;DR: This paper designs a fast distributed feature extraction and data preparation framework to extract features from raw network-wide traffic and evaluates the approach in terms of detection rate, false positive rate, precision, recall and F -measure using several high dimensional synthetic and real-world datasets.

...read moreread less

Journal Article•

Statistical Query Lower Bounds for Robust Estimation of High-dimensional Gaussians and Gaussian Mixtures.

[...]

Ilias Diakonikolas¹, Daniel M. Kane², Alistair Stewart¹•Institutions (2)

University of Southern California¹, University of California, San Diego²

01 Jan 2016-Electronic Colloquium on Computational Complexity

TL;DR: In particular, this paper showed that the complexity of learning a Gaussian mixture model is exponential in the dimension of the latent space, and showed that statistical query algorithms can be implemented in polynomial time.

...read moreread less

Abstract: We describe a general technique that yields the first Statistical Query lower bounds} fora range of fundamental high-dimensional learning problems involving Gaussian distributions. Our main results are for the problems of (1) learning Gaussian mixture models (GMMs), and (2) robust (agnostic) learning of a single unknown Gaussian distribution. For each of these problems, we show a super-polynomial gap} between the (information-theoretic)sample complexity and the computational complexity of any} Statistical Query algorithm for the problem. Statistical Query (SQ) algorithms are a class of algorithms that are only allowed to query expectations of functions of the distribution rather than directly access samples. This class of algorithms is quite broad: a wide range of known algorithmic techniques in machine learning are known to be implementable using SQs.Moreover, for the unsupervised learning problems studied in this paper, all known algorithms with non-trivial performance guarantees are SQ or are easily implementable using SQs. Our SQ lower bound for Problem (1)is qualitatively matched by known learning algorithms for GMMs. At a conceptual level, this result implies that – as far as SQ algorithms are concerned – the computational complexity of learning GMMs is inherently exponential in the dimension of the latent space} – even though there is no such information-theoretic barrier. Our lower bound for Problem (2) implies that the accuracy of the robust learning algorithm in \cite{DiakonikolasKKLMS16} is essentially best possible among all polynomial-time SQ algorithms. On the positive side, we also give a new (SQ) learning algorithm for Problem (2) achievingthe information-theoretically optimal accuracy, up to a constant factor, whose running time essentially matches our lower bound. Our algorithm relies on a filtering technique generalizing \cite{DiakonikolasKKLMS16} that removes outliers based on higher-order tensors.Our SQ lower bounds are attained via a unified moment-matching technique that is useful in other contexts and may be of broader interest. Our technique yields nearly-tight lower bounds for a number of related unsupervised estimation problems. Specifically, for the problems of (3) robust covariance estimation in spectral norm, and (4) robust sparse mean estimation, we establish a quadratic statistical–computational tradeoff} for SQ algorithms, matching known upper bounds. Finally, our technique can be used to obtain tight sample complexitylower bounds for high-dimensional testing} problems. Specifically, for the classical problem of robustly testing} an unknown mean (known covariance) Gaussian, our technique implies an information-theoretic sample lower bound that scales linearly} in the dimension. Our sample lower bound matches the sample complexity of the corresponding robust learning} problem and separates the sample complexity of robust testing from standard (non-robust) testing. This separation is surprising because such a gap does not exist for the corresponding learning problem.

...read moreread less

Journal Article•DOI•

Identification of Multivariate Outliers: A Performance Study

[...]

Peter Filzmoser¹•Institutions (1)

Vienna University of Technology¹

03 Apr 2016-Austrian Journal of Statistics

TL;DR: Three methods for the identification of multivariate outliers are compared based on the Mahalanobis distance that will be made resistant against outliers and model deviations by robust estimation of location and covariance.

...read moreread less

Abstract: Three methods for the identification of multivariate outliers (Rousseeuw and Van Zomeren, 1990; Becker and Gather, 1999; Filzmoser et al, 2005) are compared They are based on the Mahalanobis distance that will be made resistant against outliers and model deviations by robust estimation of location and covariance The comparison is made by means of a simulation study Not only the case of multivariate normally distributed data, but also heavy tailed and asymmetric distributions will be considered The simulations are focused on low dimensional ( p = 5 ) and high dimensional ( p = 30 ) data

...read moreread less

Journal Article•DOI•

Auto insurance fraud detection using unsupervised spectral ranking for anomaly

[...]

Ke Nian¹, Haofan Zhang¹, Aditya Tayal¹, Thomas F. Coleman¹, Yuying Li¹ - Show less +1 more•Institutions (1)

University of Waterloo¹

01 Mar 2016-The Journal of Finance and Data Science

TL;DR: It is illustrated that the spectral optimization in SRA can be viewed as a relaxation of an unsupervised SVM problem, and it is demonstrated that the first non-principal eigenvector of a Laplacian matrix is linked to a bi-class classification strength measure which can be used to rank anomalies.

...read moreread less

Proceedings Article•

Maximal Sparsity with Deep Networks

[...]

Bo Xin¹, Yizhou Wang¹, Wen Gao¹, David Wipf², Baoyuan Wang² - Show less +1 more•Institutions (2)

Peking University¹, Microsoft²

01 Apr 2016

TL;DR: In this paper, a neural network is used to learn iterative sparse estimation algorithms for stereo estimation, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.

...read moreread less

Abstract: The iterations of many sparse estimation algorithms are comprised of a fixed linear filter cascaded with a thresholding nonlinearity, which collectively resemble a typical neural network layer. Consequently, a lengthy sequence of algorithm iterations can be viewed as a deep network with shared, hand-crafted layer weights. It is therefore quite natural to examine the degree to which a learned network model might act as a viable surrogate for traditional sparse estimation in domains where ample training data is available. While the possibility of a reduced computational budget is readily apparent when a ceiling is imposed on the number of layers, our work primarily focuses on estimation accuracy. In particular, it is well-known that when a signal dictionary has coherent columns, as quantified by a large RIP constant, then most tractable iterative algorithms are unable to find maximally sparse representations. In contrast, we demonstrate both theoretically and empirically the potential for a trained deep network to recover minimal $\ell_0$-norm representations in regimes where existing methods fail. The resulting system, which can effectively learn novel iterative sparse estimation algorithms, is deployed on a practical photometric stereo estimation problem, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.

...read moreread less

Journal Article•DOI•

Parsimonious mixtures of multivariate contaminated normal distributions.

[...]

Antonio Punzo¹, Paul D. McNicholas²•Institutions (2)

University of Catania¹, McMaster University²

11 Aug 2016-Biometrical Journal

TL;DR: In this paper, a mixture of multivariate contaminated normal distributions is developed for model-based clustering, where each cluster has a parameter controlling the proportion of mild outliers and one specifying the degree of contamination.

...read moreread less

Abstract: A mixture of multivariate contaminated normal distributions is developed for model-based clustering. In addition to the parameters of the classical normal mixture, our contaminated mixture has, for each cluster, a parameter controlling the proportion of mild outliers and one specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach. Parsimony is introduced via eigen-decomposition of the component covariance matrices, and sufficient conditions for the identifiability of all the members of the resulting family are provided. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. Using a large-scale simulation study, the behavior of the proposed approach is investigated and comparison with well-established finite mixtures is provided. The performance of this novel family of models is also illustrated on artificial and real data.

...read moreread less

Journal Article•DOI•

Extremal Depth for Functional Data and Applications

[...]

Naveen N. Narisetty¹, Vijayan N. Nair¹•Institutions (1)

University of Michigan¹

01 Oct 2016-Journal of the American Statistical Association

TL;DR: Extremal depth (ED) as discussed by the authors is a new notion for functional data, which is based on a measure of extreme "outlyingness" and is especially suited for obtaining central regions of functional data and function spaces.

...read moreread less

Abstract: We propose a new notion called “extremal depth” (ED) for functional data, discuss its properties, and compare its performance with existing concepts. The proposed notion is based on a measure of extreme “outlyingness.” ED has several desirable properties that are not shared by other notions and is especially well suited for obtaining central regions of functional data and function spaces. In particular: (a) the central region achieves the nominal (desired) simultaneous coverage probability; (b) there is a correspondence between ED-based (simultaneous) central regions and appropriate pointwise central regions; and (c) the method is resistant to certain classes of functional outliers. The article examines the performance of ED and compares it with other depth notions. Its usefulness is demonstrated through applications to constructing central regions, functional boxplots, outlier detection, and simultaneous confidence bands in regression problems. Supplementary materials for this article are availab...

...read moreread less

Journal Article•DOI•

Efficient and flexible algorithms for monitoring distance-based outliers over data streams

[...]

Maria Kontaki¹, Anastasios Gounaris¹, Apostolos N. Papadopoulos¹, Kostas Tsichlas¹, Yannis Manolopoulos¹ - Show less +1 more•Institutions (1)

Aristotle University of Thessaloniki¹

01 Jan 2016-Information Systems

TL;DR: New algorithms for continuous outlier monitoring in data streams, based on sliding windows are proposed, able to reduce the required storage overhead, are more efficient than previously proposed techniques and offer significant flexibility with regard to the input parameters.

...read moreread less

Journal Article•DOI•

Moving-horizon estimation with guaranteed robustness for discrete-time linear systems and measurements subject to outliers

[...]

Angelo Alessandri¹, Moath Awawdeh¹•Institutions (1)

University of Genoa¹

01 May 2016-Automatica

TL;DR: An approach to state estimation for discrete-time linear time-invariant systems with measurements that may be affected by outliers is presented by using only a batch of most recent inputs and outputs according to a moving-horizon strategy.

...read moreread less

Journal Article•DOI•

Robust Bayes estimation using the density power divergence

[...]

Abhik Ghosh¹, Ayanendranath Basu¹•Institutions (1)

Indian Statistical Institute¹

01 Apr 2016-Annals of the Institute of Statistical Mathematics

TL;DR: An estimation method is developed based on the so-called “R^{(\alpha )}$$R(α)-posterior density” that uses the concept of priors in Bayesian context and generates highly robust estimators with good efficiency under the true model.

...read moreread less

Abstract: The ordinary Bayes estimator based on the posterior density can have potential problems with outliers. Using the density power divergence measure, we develop an estimation method in this paper based on the so-called “$R^{(\alpha )}$-posterior density”; this construction uses the concept of priors in Bayesian context and generates highly robust estimators with good efficiency under the true model. We develop the asymptotic properties of the proposed estimator and illustrate its performance numerically.

...read moreread less

Journal Article•DOI•

Water distribution systems flow monitoring and anomalous event detection: A practical approach

[...]

Dália Loureiro, Conceição Amado¹, André Martins, Diogo Vitorino, Aisha Mamade, Sérgio T. Coelho - Show less +2 more•Institutions (1)

Instituto Superior Técnico¹

02 Apr 2016-Urban Water Journal

TL;DR: Methods to detect outliers in network flow measurements that may be due to pipe bursts or unusual consumptions are fundamental to improve water distribution system on-line operation and management, and to ensure reliable historical data for sustainable planning and design of these systems.

...read moreread less

Abstract: Methods to detect outliers in network flow measurements that may be due to pipe bursts or unusual consumptions are fundamental to improve water distribution system on-line operation and management, and to ensure reliable historical data for sustainable planning and design of these systems. To detect and classify anomalous events in flow data from district metering areas a four-step methodology was adopted, implemented and tested: i) data acquisition, ii) data validation and normalization, iii) anomalous observation detection, iv) anomalous event detection and characterization. This approach is based on the renewed concept of outlier regions and depends on a reduced number of configuration parameters: the number of past observations, the true positive rate and the false positive rate. Results indicate that this approach is flexible and applicable to the detection of different types of events (e.g., pipe burst, unusual consumption) and to different flow time series (e.g., instantaneous, minimum night flow).

...read moreread less

Journal Article•DOI•

Asymptotic Theory of Outlier Detection Algorithms for Linear Time Series Regression Models

[...]

Søren Johansen¹, Bent Nielsen²•Institutions (2)

University of Copenhagen¹, University of Oxford²

01 Jun 2016-Scandinavian Journal of Statistics

TL;DR: In this paper, the authors define a number of outlier detection algorithms related to the Huber-skip and least trimmed squares estimators, including the one-step Huberskip estimator and the forward search.

...read moreread less

Abstract: Outlier detection algorithms are intimately connected with robust statistics that down-weight some observations to zero. We define a number of outlier detection algorithms related to the Huber-skip and least trimmed squares estimators, including the one-step Huber-skip estimator and the forward search. Next, we review a recently developed asymptotic theory of these. Finally, we analyse the gauge, the fraction of wrongly detected outliers, for a number of outlier detection algorithms and establish an asymptotic normal and a Poisson theory for the gauge.

...read moreread less

Journal Article•DOI•

Outlier detection methods for generalized lattices: a case study on the transition from ANOVA to REML

[...]

Angela-Maria Bernal-Vasquez¹, H-Friedrich Utz¹, Hans-Peter Piepho¹•Institutions (1)

University of Hohenheim¹

16 Feb 2016-Theoretical and Applied Genetics

TL;DR: The method combining the Bonferroni–Holm test to judge each residual and the residual standardization strategy of PlabStat exhibited good ability to detect outliers in small and large datasets and under a genomic prediction application.

...read moreread less

Abstract: We review and propose several methods for identifying possible outliers and evaluate their properties. The methods are applied to a genomic prediction program in hybrid rye. Many plant breeders use ANOVA-based software for routine analysis of field trials. These programs may offer specific in-built options for residual analysis that are lacking in current REML software. With the advance of molecular technologies, there is a need to switch to REML-based approaches, but without losing the good features of outlier detection methods that have proven useful in the past. Our aims were to compare the variance component estimates between ANOVA and REML approaches, to scrutinize the outlier detection method of the ANOVA-based package PlabStat and to propose and evaluate alternative procedures for outlier detection. We compared the outputs produced using ANOVA and REML approaches of four published datasets of generalized lattice designs. Five outlier detection methods are explained step by step. Their performance was evaluated by measuring the true positive rate and the false positive rate in a dataset with artificial outliers simulated in several scenarios. An implementation of genomic prediction using an empirical rye multi-environment trial was used to assess the outlier detection methods with respect to the predictive abilities of a mixed model for each method. We provide a detailed explanation of how the PlabStat outlier detection methodology can be translated to REML-based software together with the evaluation of alternative methods to identify outliers. The method combining the Bonferroni–Holm test to judge each residual and the residual standardization strategy of PlabStat exhibited good ability to detect outliers in small and large datasets and under a genomic prediction application. We recommend the use of outlier detection methods as a decision support in the routine data analyses of plant breeding experiments.

...read moreread less

Journal Article•DOI•

Robust identification of OE model with constrained output using optimal input design

[...]

Vladimir Stojanovic¹, Novak Nedic¹•Institutions (1)

University of Kragujevac¹

01 Jan 2016-Journal of The Franklin Institute-engineering and Applied Mathematics

TL;DR: Theoretical results are illustrated by simulations which show significant increasing of accuracy in parameter estimates of the OE model by using the robust identification procedure in relation to the linear identification algorithm for OE models.

...read moreread less

Abstract: This paper considers the robust algorithm for identification of OE (output error) model with constrained output in presence of non-Gaussian noises. In practical conditions, in measurements there are rare, inconsistent observations with the largest part of population of observations (outliers). The synthesis of robust algorithms is based on Huber׳s theory of robust statistics. Also, it is known fact that constraints play a very important role in many practical cases. If constraints are not taken into consideration, the control performance can corrupt and safety of a process may be at risk. The practical value of proposed robust algorithm for estimation of OE model parameters with constrained output variance is further increased by using an optimal input design. It is shown that the optimal input can be obtained by a minimum variance controller whose reference is a white noise sequence with known variance. A key problem is that the optimal input depends on system parameters to be identified. In order to be able to implement the proposed optimal input, the adaptive two-stage procedure for generating the input signal is proposed. Theoretical results are illustrated by simulations which show significant increasing of accuracy in parameter estimates of the OE model by using the robust identification procedure in relation to the linear identification algorithm for OE models. Also, it can be seen that the convergence rate of the robust algorithm is further increased by using the optimal input design, which increases the practical value of proposed robust procedure.

...read moreread less

Journal Article•DOI•

An efficient algorithm for distributed density-based outlier detection on big data

[...]

Mei Bai¹, Xite Wang¹, Junchang Xin¹, Guoren Wang¹•Institutions (1)

Northeastern University (China)¹

12 Mar 2016-Neurocomputing

TL;DR: This paper proposes a Gird-Based Partition algorithm (GBP) as a data preparation method, and a Distributed LOF Computing method (DLC) for detecting density-based outliers in parallel, which only needs a small amount of network communications.

...read moreread less

Journal Article•DOI•

Robust Visual Tracking via Least Soft-Threshold Squares

[...]

Dong Wang¹, Huchuan Lu¹, Ming-Hsuan Yang²•Institutions (2)

Dalian University of Technology¹, University of California, Merced²

01 Sep 2016-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: This paper proposes an online tracking algorithm based on a novel robust linear regression estimator that models the error term with the Gaussian-Laplacian distribution, which can be efficiently solved and provides insights on the relationships among the LSS problem, Huber loss function, and trivial templates.

...read moreread less

Abstract: In this paper, we propose an online tracking algorithm based on a novel robust linear regression estimator. In contrast to existing methods, the proposed least soft-threshold squares (LSS) algorithm models the error term with the Gaussian–Laplacian distribution, which can be efficiently solved. For visual tracking, the Gaussian–Laplacian noise assumption enables our LSS model to handle the normal appearance change and outlier simultaneously. Based on the maximum joint likelihood of parameters, we derive an LSS distance metric to measure the difference between an observation sample and a dictionary of positive templates. Compared with the distance derived from ordinary least squares methods, the proposed metric is more effective in dealing with the outliers. In addition, we provide insights on the relationships among the LSS problem, Huber loss function, and trivial templates, which facilitate better understandings of the existing tracking methods. Finally, we develop a robust tracking algorithm based on the LSS distance metric with an update scheme and negative templates, and speed it up with a particle selection mechanism. Experimental results on numerous challenging image sequences demonstrate that the proposed tracking algorithm performs favorably than the state-of-the-art methods.

...read moreread less

Proceedings Article•DOI•

A model-based approach for text clustering with outlier detection

[...]

Jianhua Yin¹, Jianyong Wang¹•Institutions (1)

Tsinghua University¹

01 May 2016

TL;DR: This paper proposes a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem oftext clustering.

...read moreread less

Abstract: Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.

...read moreread less

Journal Article•DOI•

Missing value imputation using a fuzzy clustering-based EM approach

[...]

Md. Geaur Rahman¹, Zahidul Islam¹•Institutions (1)

Charles Sturt University¹

01 Feb 2016-Knowledge and Information Systems

TL;DR: A novel technique called FEMI, which imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value, and applies a fuzzy clustering approach and the authors' novel fuzzy expectation maximization algorithm.

...read moreread less

Abstract: Data preprocessing and cleansing play a vital role in data mining by ensuring good quality of data. Data-cleansing tasks include imputation of missing values, identification of outliers, and identification and correction of noisy data. In this paper, we present a novel technique called A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI). It imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value. While identifying a group of similar records and making a guess based on the group, it applies a fuzzy clustering approach and our novel fuzzy expectation maximization algorithm. We evaluate FEMI on eight publicly available natural data sets by comparing its performance with the performance of five high-quality existing techniques, namely EMI, GkNN, FKMI, SVR and IBLLS. We use thirty-two types (patterns) of missing values for each data set. Two evaluation criteria namely root mean squared error and mean absolute error are used. Our experimental results indicate (according to a confidence interval and $$t$$t test analysis) that FEMI performs significantly better than EMI, GkNN, FKMI, SVR, and IBLLS.

...read moreread less

Collapse