scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2016"


Journal ArticleDOI
TL;DR: The robust empirical Bayes (RB) algorithm as mentioned in this paper improves the robust differential expression tests by robustifying the hyperparameter estimation procedure, which has the double benefit of reducing the chance that hypervariable genes will be spuriously identified as DE while increasing statistical power for the main body of genes.
Abstract: One of the most common analysis tasks in genomic research is to identify genes that are differentially expressed (DE) between experimental conditions. Empirical Bayes (EB) statistical tests using moderated genewise variances have been very effective for this purpose, especially when the number of biological replicate samples is small. The EB procedures can however be heavily influenced by a small number of genes with very large or very small variances. This article improves the differential expression tests by robustifying the hyperparameter estimation procedure. The robust procedure has the effect of decreasing the informativeness of the prior distribution for outlier genes while increasing its informativeness for other genes. This effect has the double benefit of reducing the chance that hypervariable genes will be spuriously identified as DE while increasing statistical power for the main body of genes. The robust EB algorithm is fast and numerically stable. The procedure allows exact small-sample null distributions for the test statistics and reduces exactly to the original EB procedure when no outlier genes are present. Simulations show that the robustified tests have similar performance to the original tests in the absence of outlier genes but have greater power and robustness when outliers are present. The article includes case studies for which the robust method correctly identifies and downweights genes associated with hidden covariates and detects more genes likely to be scientifically relevant to the experimental conditions. The new procedure is implemented in the limma software package freely available from the Bioconductor repository.

632 citations


Journal ArticleDOI
TL;DR: An extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose, and provides a characterization of the datasets themselves.
Abstract: The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier detection models, and the impact of parameter choices for these algorithms. The scarcity of appropriate benchmark datasets with ground truth annotation is a significant impediment to the evaluation of outlier methods. Even when labeled datasets are available, their suitability for the outlier detection task is typically unknown. Furthermore, the biases of commonly-used evaluation measures are not fully understood. It is thus difficult to ascertain the extent to which newly-proposed outlier detection methods improve over established methods. In this paper, we perform an extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose. Based on the overall performance of the outlier detection methods, we provide a characterization of the datasets themselves, and discuss their suitability as outlier detection benchmark sets. We also examine the most commonly-used measures for comparing the performance of different methods, and suggest adaptations that are more suitable for the evaluation of outlier detection results.

552 citations


Journal ArticleDOI
TL;DR: Simulations show that the robustified tests have similar performance to the original tests in the absence of outlier genes but have greater power and robustness when outliers are present, and the robust method correctly identifies and downweights genes associated with hidden covariates and detects more genes likely to be scientifically relevant to the experimental conditions.
Abstract: One of the most common analysis tasks in genomic research is to identify genes that are differentially expressed (DE) between experimental conditions. Empirical Bayes (EB) statistical tests using moderated genewise variances have been very effective for this purpose, especially when the number of biological replicate samples is small. The EB procedures can however be heavily influenced by a small number of genes with very large or very small variances. This article improves the differential expression tests by robustifying the hyperparameter estimation procedure. The robust procedure has the effect of decreasing the informativeness of the prior distribution for outlier genes while increasing its informativeness for other genes. This effect has the double benefit of reducing the chance that hypervariable genes will be spuriously identified as DE while increasing statistical power for the main body of genes. The robust EB algorithm is fast and numerically stable. The procedure allows exact small-sample null distributions for the test statistics and reduces exactly to the original EB procedure when no outlier genes are present. Simulations show that the robustified tests have similar performance to the original tests in the absence of outlier genes but have greater power and robustness when outliers are present. The article includes case studies for which the robust method correctly identifies and downweights genes associated with hidden covariates and detects more genes likely to be scientifically relevant to the experimental conditions. The new procedure is implemented in the limma software package freely available from the Bioconductor repository.

489 citations


Journal ArticleDOI
TL;DR: An R package, robustlmm, is introduced, designed to robustly fit linear mixed-effects models, to provide estimates where contamination has only little influence and to detect and flag contamination.
Abstract: As any real-life data, data modeled by linear mixed-effects models often contain outliers or other contamination. Even little contamination can drive the classic estimates far away from what they would be without the contamination. At the same time, datasets that require mixed-effects modeling are often complex and large. This makes it difficult to spot contamination. Robust estimation methods aim to solve both problems: to provide estimates where contamination has only little influence and to detect and flag contamination. We introduce an R package, robustlmm, to robustly fit linear mixed-effects models. The package's functions and methods are designed to closely equal those offered by lme4, the R package that implements classic linear mixed-effects model estimation in R. The robust estimation method in robustlmm is based on the random effects contamination model and the central contamination model. Contamination can be detected at all levels of the data. The estimation method does not make any assumption on the data's grouping structure except that the model parameters are estimable. robustlmm supports hierarchical and non-hierarchical (e.g., crossed) grouping structures. The robustness of the estimates and their asymptotic efficiency is fully controlled through the function interface. Individual parts (e.g., fixed effects and variance components) can be tuned independently. In this tutorial, we show how to fit robust linear mixed-effects models using robustlmm, how to assess the model fit, how to detect outliers, and how to compare different fits.

340 citations


Journal ArticleDOI
TL;DR: The robustness of this approach under various types of distortions, such as deformation, noise, outliers, rotation, and occlusion, greatly outperforms the state-of-the-art methods, especially when the data is badly degraded.
Abstract: In previous work on point registration, the input point sets are often represented using Gaussian mixture models and the registration is then addressed through a probabilistic approach, which aims to exploit global relationships on the point sets. For non-rigid shapes, however, the local structures among neighboring points are also strong and stable and thus helpful in recovering the point correspondence. In this paper, we formulate point registration as the estimation of a mixture of densities, where local features, such as shape context, are used to assign the membership probabilities of the mixture model. This enables us to preserve both global and local structures during matching. The transformation between the two point sets is specified in a reproducing kernel Hilbert space and a sparse approximation is adopted to achieve a fast implementation. Extensive experiments on both synthesized and real data show the robustness of our approach under various types of distortions, such as deformation, noise, outliers, rotation, and occlusion. It greatly outperforms the state-of-the-art methods, especially when the data is badly degraded.

311 citations


Proceedings ArticleDOI
09 Apr 2016
TL;DR: The system presents four key features: a big data behavioral analytics platform, an outlier detection system, a mechanism to obtain feedback from security analysts, and a supervised learning module, which is capable of learning to defend against unseen attacks.
Abstract: We present AI2, an analyst-in-the-loop security system where Analyst Intuition (AI) is put together with state-of-the-art machine learning to build a complete end-to-end Artificially Intelligent solution (AI). The system presents four key features: a big data behavioral analytics platform, an outlier detection system, a mechanism to obtain feedback from security analysts, and a supervised learning module. We validate our system with a real-world data set consisting of 3.6 billion log lines and 70.2 million entities. The results show that the system is capable of learning to defend against unseen attacks. With respect to unsupervised outlier analysis, our system improves the detection rate in 2.92× and reduces false positives by more than 5×.

233 citations


Journal ArticleDOI
TL;DR: Development here of Median Interannual Difference Adjusted for Skewness (MIDAS), a variant of the Theil‐Sen median trend estimator, which computes a robust and realistic estimate of trend uncertainty and has the potential for broader application in the geosciences.
Abstract: Automatic estimation of velocities from GPS coordinate time series is becoming required to cope with the exponentially increasing flood of available data, but problems detectable to the human eye are often overlooked. This motivates us to find an automatic and accurate estimator of trend that is resistant to common problems such as step discontinuities, outliers, seasonality, skewness, and heteroscedasticity. Developed here, Median Interannual Difference Adjusted for Skewness (MIDAS) is a variant of the Theil-Sen median trend estimator, for which the ordinary version is the median of slopes vij = (xj-xi )/(tj-ti ) computed between all data pairs i > j. For normally distributed data, Theil-Sen and least squares trend estimates are statistically identical, but unlike least squares, Theil-Sen is resistant to undetected data problems. To mitigate both seasonality and step discontinuities, MIDAS selects data pairs separated by 1 year. This condition is relaxed for time series with gaps so that all data are used. Slopes from data pairs spanning a step function produce one-sided outliers that can bias the median. To reduce bias, MIDAS removes outliers and recomputes the median. MIDAS also computes a robust and realistic estimate of trend uncertainty. Statistical tests using GPS data in the rigid North American plate interior show ±0.23 mm/yr root-mean-square (RMS) accuracy in horizontal velocity. In blind tests using synthetic data, MIDAS velocities have an RMS accuracy of ±0.33 mm/yr horizontal, ±1.1 mm/yr up, with a 5th percentile range smaller than all 20 automatic estimators tested. Considering its general nature, MIDAS has the potential for broader application in the geosciences.

211 citations


Journal ArticleDOI
TL;DR: In this paper, a composition-based multi-graph matching method is proposed to incorporate the two aspects by optimizing the affinity score, meanwhile gradually infusing the consistency, which can serve as a regularizer in the affinity objective function.
Abstract: This paper addresses the problem of matching common node correspondences among multiple graphs referring to an identical or related structure This multi-graph matching problem involves two correlated components: i) the local pairwise matching affinity across pairs of graphs; ii) the global matching consistency that measures the uniqueness of the pairwise matchings by different composition orders Previous studies typically either enforce the matching consistency constraints in the beginning of an iterative optimization, which may propagate matching error both over iterations and across graph pairs; or separate affinity optimization and consistency enforcement into two steps This paper is motivated by the observation that matching consistency can serve as a regularizer in the affinity objective function especially when the function is biased due to noises or inappropriate modeling We propose composition-based multi-graph matching methods to incorporate the two aspects by optimizing the affinity score, meanwhile gradually infusing the consistency We also propose two mechanisms to elicit the common inliers against outliers Compelling results on synthetic and real images show the competency of our algorithms

178 citations


Journal ArticleDOI
01 Aug 2016
TL;DR: This work systematically evaluates the most recent algorithms for DODDS under various stream settings and outlier rates, and shows that in most settings, the MCOD algorithm offers the superior performance among all the algorithms, including the most recently algorithm Thresh_LEAP.
Abstract: Continuous outlier detection in data streams has important applications in fraud detection, network security, and public health. The arrival and departure of data objects in a streaming manner impose new challenges for outlier detection algorithms, especially in time and space efficiency. In the past decade, several studies have been performed to address the problem of distance-based outlier detection in data streams (DODDS), which adopts an unsupervised definition and does not have any distributional assumptions on data values. Our work is motivated by the lack of comparative evaluation among the state-of-the-art algorithms using the same datasets on the same platform. We systematically evaluate the most recent algorithms for DODDS under various stream settings and outlier rates. Our extensive results show that in most settings, the MCOD algorithm offers the superior performance among all the algorithms, including the most recent algorithm Thresh_LEAP.

121 citations


Journal ArticleDOI
TL;DR: A novel notion that Natural Outlier Factor (NOF) to measure the outliers and provide the algorithm based on Natural Neighbor (NaN) that does not require any parameters to compute the NOF of the objects in the database is proposed.
Abstract: Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. Although many Outlier detection algorithm have been proposed. However, for most of these algorithms faced a serious problem that it is very difficult to select an appropriate parameter when they run on a dataset. In this paper we use the method of Natural Neighbor to adaptively obtain the parameter, named Natural Value. We also propose a novel notion that Natural Outlier Factor (NOF) to measure the outliers and provide the algorithm based on Natural Neighbor (NaN) that does not require any parameters to compute the NOF of the objects in the database. The formal analysis and experiments show that this method can achieve good performance in outlier detection.

104 citations


Journal ArticleDOI
TL;DR: The proposed Generalized Logistic algorithm is simple yet effective, robust to outliers, so no additional denoising or outlier detection step is needed in data preprocessing, and empirical results show models learned from data scaled by the GL algorithm have higher accuracy compared to the commonly used data scaling algorithms.
Abstract: Background Machine learning models have been adapted in biomedical research and practice for knowledge discovery and decision support. While mainstream biomedical informatics research focuses on developing more accurate models, the importance of data preprocessing draws less attention. We propose the Generalized Logistic (GL) algorithm that scales data uniformly to an appropriate interval by learning a generalized logistic function to fit the empirical cumulative distribution function of the data. The GL algorithm is simple yet effective; it is intrinsically robust to outliers, so it is particularly suitable for diagnostic/classification models in clinical/medical applications where the number of samples is usually small; it scales the data in a nonlinear fashion, which leads to potential improvement in accuracy.

Journal ArticleDOI
TL;DR: A memory efficient incremental local outlier detection algorithm for data streams, and a more flexible version (MiLOF F), both have an accuracy close to iLOF but within a fixed memory bound are proposed.
Abstract: Outlier detection is an important task in data mining, with applications ranging from intrusion detection to human gait analysis. With the growing need to analyze high speed data streams, the task of outlier detection becomes even more challenging as traditional outlier detection techniques can no longer assume that all the data can be stored for processing. While the well-known Local Outlier Factor (LOF) algorithm has an incremental version, it assumes unbounded memory to keep all previous data points. In this paper, we propose a memory efficient incremental local outlier (MiLOF) detection algorithm for data streams, and a more flexible version (MiLOF_F), both have an accuracy close to Incremental LOF but within a fixed memory bound. Our experimental results show that both proposed approaches have better memory and time complexity than Incremental LOF while having comparable accuracy. In addition, we show that MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. These results show that MiLOF/MiLOF_F are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams.

Journal ArticleDOI
TL;DR: This paper designs a fast distributed feature extraction and data preparation framework to extract features from raw network-wide traffic and evaluates the approach in terms of detection rate, false positive rate, precision, recall and F -measure using several high dimensional synthetic and real-world datasets.

Journal Article
TL;DR: In particular, this paper showed that the complexity of learning a Gaussian mixture model is exponential in the dimension of the latent space, and showed that statistical query algorithms can be implemented in polynomial time.
Abstract: We describe a general technique that yields the first Statistical Query lower bounds} fora range of fundamental high-dimensional learning problems involving Gaussian distributions. Our main results are for the problems of (1) learning Gaussian mixture models (GMMs), and (2) robust (agnostic) learning of a single unknown Gaussian distribution. For each of these problems, we show a super-polynomial gap} between the (information-theoretic)sample complexity and the computational complexity of any} Statistical Query algorithm for the problem. Statistical Query (SQ) algorithms are a class of algorithms that are only allowed to query expectations of functions of the distribution rather than directly access samples. This class of algorithms is quite broad: a wide range of known algorithmic techniques in machine learning are known to be implementable using SQs.Moreover, for the unsupervised learning problems studied in this paper, all known algorithms with non-trivial performance guarantees are SQ or are easily implementable using SQs. Our SQ lower bound for Problem (1)is qualitatively matched by known learning algorithms for GMMs. At a conceptual level, this result implies that – as far as SQ algorithms are concerned – the computational complexity of learning GMMs is inherently exponential in the dimension of the latent space} – even though there is no such information-theoretic barrier. Our lower bound for Problem (2) implies that the accuracy of the robust learning algorithm in \cite{DiakonikolasKKLMS16} is essentially best possible among all polynomial-time SQ algorithms. On the positive side, we also give a new (SQ) learning algorithm for Problem (2) achievingthe information-theoretically optimal accuracy, up to a constant factor, whose running time essentially matches our lower bound. Our algorithm relies on a filtering technique generalizing \cite{DiakonikolasKKLMS16} that removes outliers based on higher-order tensors.Our SQ lower bounds are attained via a unified moment-matching technique that is useful in other contexts and may be of broader interest. Our technique yields nearly-tight lower bounds for a number of related unsupervised estimation problems. Specifically, for the problems of (3) robust covariance estimation in spectral norm, and (4) robust sparse mean estimation, we establish a quadratic statistical–computational tradeoff} for SQ algorithms, matching known upper bounds. Finally, our technique can be used to obtain tight sample complexitylower bounds for high-dimensional testing} problems. Specifically, for the classical problem of robustly testing} an unknown mean (known covariance) Gaussian, our technique implies an information-theoretic sample lower bound that scales linearly} in the dimension. Our sample lower bound matches the sample complexity of the corresponding robust learning} problem and separates the sample complexity of robust testing from standard (non-robust) testing. This separation is surprising because such a gap does not exist for the corresponding learning problem.

Journal ArticleDOI
TL;DR: Three methods for the identification of multivariate outliers are compared based on the Mahalanobis distance that will be made resistant against outliers and model deviations by robust estimation of location and covariance.
Abstract: Three methods for the identification of multivariate outliers (Rousseeuw and Van Zomeren, 1990; Becker and Gather, 1999; Filzmoser et al, 2005) are compared They are based on the Mahalanobis distance that will be made resistant against outliers and model deviations by robust estimation of location and covariance The comparison is made by means of a simulation study Not only the case of multivariate normally distributed data, but also heavy tailed and asymmetric distributions will be considered The simulations are focused on low dimensional ( p = 5 ) and high dimensional ( p = 30 ) data

Journal ArticleDOI
Ke Nian1, Haofan Zhang1, Aditya Tayal1, Thomas F. Coleman1, Yuying Li1 
TL;DR: It is illustrated that the spectral optimization in SRA can be viewed as a relaxation of an unsupervised SVM problem, and it is demonstrated that the first non-principal eigenvector of a Laplacian matrix is linked to a bi-class classification strength measure which can be used to rank anomalies.

Proceedings Article
Bo Xin1, Yizhou Wang1, Wen Gao1, David Wipf2, Baoyuan Wang2 
01 Apr 2016
TL;DR: In this paper, a neural network is used to learn iterative sparse estimation algorithms for stereo estimation, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.
Abstract: The iterations of many sparse estimation algorithms are comprised of a fixed linear filter cascaded with a thresholding nonlinearity, which collectively resemble a typical neural network layer. Consequently, a lengthy sequence of algorithm iterations can be viewed as a deep network with shared, hand-crafted layer weights. It is therefore quite natural to examine the degree to which a learned network model might act as a viable surrogate for traditional sparse estimation in domains where ample training data is available. While the possibility of a reduced computational budget is readily apparent when a ceiling is imposed on the number of layers, our work primarily focuses on estimation accuracy. In particular, it is well-known that when a signal dictionary has coherent columns, as quantified by a large RIP constant, then most tractable iterative algorithms are unable to find maximally sparse representations. In contrast, we demonstrate both theoretically and empirically the potential for a trained deep network to recover minimal $\ell_0$-norm representations in regimes where existing methods fail. The resulting system, which can effectively learn novel iterative sparse estimation algorithms, is deployed on a practical photometric stereo estimation problem, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.

Journal ArticleDOI
TL;DR: In this paper, a mixture of multivariate contaminated normal distributions is developed for model-based clustering, where each cluster has a parameter controlling the proportion of mild outliers and one specifying the degree of contamination.
Abstract: A mixture of multivariate contaminated normal distributions is developed for model-based clustering. In addition to the parameters of the classical normal mixture, our contaminated mixture has, for each cluster, a parameter controlling the proportion of mild outliers and one specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach. Parsimony is introduced via eigen-decomposition of the component covariance matrices, and sufficient conditions for the identifiability of all the members of the resulting family are provided. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. Using a large-scale simulation study, the behavior of the proposed approach is investigated and comparison with well-established finite mixtures is provided. The performance of this novel family of models is also illustrated on artificial and real data.

Journal ArticleDOI
TL;DR: Extremal depth (ED) as discussed by the authors is a new notion for functional data, which is based on a measure of extreme "outlyingness" and is especially suited for obtaining central regions of functional data and function spaces.
Abstract: We propose a new notion called “extremal depth” (ED) for functional data, discuss its properties, and compare its performance with existing concepts. The proposed notion is based on a measure of extreme “outlyingness.” ED has several desirable properties that are not shared by other notions and is especially well suited for obtaining central regions of functional data and function spaces. In particular: (a) the central region achieves the nominal (desired) simultaneous coverage probability; (b) there is a correspondence between ED-based (simultaneous) central regions and appropriate pointwise central regions; and (c) the method is resistant to certain classes of functional outliers. The article examines the performance of ED and compares it with other depth notions. Its usefulness is demonstrated through applications to constructing central regions, functional boxplots, outlier detection, and simultaneous confidence bands in regression problems. Supplementary materials for this article are availab...

Journal ArticleDOI
TL;DR: New algorithms for continuous outlier monitoring in data streams, based on sliding windows are proposed, able to reduce the required storage overhead, are more efficient than previously proposed techniques and offer significant flexibility with regard to the input parameters.

Journal ArticleDOI
TL;DR: An approach to state estimation for discrete-time linear time-invariant systems with measurements that may be affected by outliers is presented by using only a batch of most recent inputs and outputs according to a moving-horizon strategy.

Journal ArticleDOI
TL;DR: An estimation method is developed based on the so-called “R^{(\alpha )}$$R(α)-posterior density” that uses the concept of priors in Bayesian context and generates highly robust estimators with good efficiency under the true model.
Abstract: The ordinary Bayes estimator based on the posterior density can have potential problems with outliers. Using the density power divergence measure, we develop an estimation method in this paper based on the so-called “\(R^{(\alpha )}\)-posterior density”; this construction uses the concept of priors in Bayesian context and generates highly robust estimators with good efficiency under the true model. We develop the asymptotic properties of the proposed estimator and illustrate its performance numerically.

Journal ArticleDOI
TL;DR: Methods to detect outliers in network flow measurements that may be due to pipe bursts or unusual consumptions are fundamental to improve water distribution system on-line operation and management, and to ensure reliable historical data for sustainable planning and design of these systems.
Abstract: Methods to detect outliers in network flow measurements that may be due to pipe bursts or unusual consumptions are fundamental to improve water distribution system on-line operation and management, and to ensure reliable historical data for sustainable planning and design of these systems. To detect and classify anomalous events in flow data from district metering areas a four-step methodology was adopted, implemented and tested: i) data acquisition, ii) data validation and normalization, iii) anomalous observation detection, iv) anomalous event detection and characterization. This approach is based on the renewed concept of outlier regions and depends on a reduced number of configuration parameters: the number of past observations, the true positive rate and the false positive rate. Results indicate that this approach is flexible and applicable to the detection of different types of events (e.g., pipe burst, unusual consumption) and to different flow time series (e.g., instantaneous, minimum night flow).

Journal ArticleDOI
TL;DR: In this paper, the authors define a number of outlier detection algorithms related to the Huber-skip and least trimmed squares estimators, including the one-step Huberskip estimator and the forward search.
Abstract: Outlier detection algorithms are intimately connected with robust statistics that down-weight some observations to zero. We define a number of outlier detection algorithms related to the Huber-skip and least trimmed squares estimators, including the one-step Huber-skip estimator and the forward search. Next, we review a recently developed asymptotic theory of these. Finally, we analyse the gauge, the fraction of wrongly detected outliers, for a number of outlier detection algorithms and establish an asymptotic normal and a Poisson theory for the gauge.

Journal ArticleDOI
TL;DR: The method combining the Bonferroni–Holm test to judge each residual and the residual standardization strategy of PlabStat exhibited good ability to detect outliers in small and large datasets and under a genomic prediction application.
Abstract: We review and propose several methods for identifying possible outliers and evaluate their properties. The methods are applied to a genomic prediction program in hybrid rye. Many plant breeders use ANOVA-based software for routine analysis of field trials. These programs may offer specific in-built options for residual analysis that are lacking in current REML software. With the advance of molecular technologies, there is a need to switch to REML-based approaches, but without losing the good features of outlier detection methods that have proven useful in the past. Our aims were to compare the variance component estimates between ANOVA and REML approaches, to scrutinize the outlier detection method of the ANOVA-based package PlabStat and to propose and evaluate alternative procedures for outlier detection. We compared the outputs produced using ANOVA and REML approaches of four published datasets of generalized lattice designs. Five outlier detection methods are explained step by step. Their performance was evaluated by measuring the true positive rate and the false positive rate in a dataset with artificial outliers simulated in several scenarios. An implementation of genomic prediction using an empirical rye multi-environment trial was used to assess the outlier detection methods with respect to the predictive abilities of a mixed model for each method. We provide a detailed explanation of how the PlabStat outlier detection methodology can be translated to REML-based software together with the evaluation of alternative methods to identify outliers. The method combining the Bonferroni–Holm test to judge each residual and the residual standardization strategy of PlabStat exhibited good ability to detect outliers in small and large datasets and under a genomic prediction application. We recommend the use of outlier detection methods as a decision support in the routine data analyses of plant breeding experiments.

Journal ArticleDOI
TL;DR: Theoretical results are illustrated by simulations which show significant increasing of accuracy in parameter estimates of the OE model by using the robust identification procedure in relation to the linear identification algorithm for OE models.
Abstract: This paper considers the robust algorithm for identification of OE (output error) model with constrained output in presence of non-Gaussian noises. In practical conditions, in measurements there are rare, inconsistent observations with the largest part of population of observations (outliers). The synthesis of robust algorithms is based on Huber׳s theory of robust statistics. Also, it is known fact that constraints play a very important role in many practical cases. If constraints are not taken into consideration, the control performance can corrupt and safety of a process may be at risk. The practical value of proposed robust algorithm for estimation of OE model parameters with constrained output variance is further increased by using an optimal input design. It is shown that the optimal input can be obtained by a minimum variance controller whose reference is a white noise sequence with known variance. A key problem is that the optimal input depends on system parameters to be identified. In order to be able to implement the proposed optimal input, the adaptive two-stage procedure for generating the input signal is proposed. Theoretical results are illustrated by simulations which show significant increasing of accuracy in parameter estimates of the OE model by using the robust identification procedure in relation to the linear identification algorithm for OE models. Also, it can be seen that the convergence rate of the robust algorithm is further increased by using the optimal input design, which increases the practical value of proposed robust procedure.

Journal ArticleDOI
TL;DR: This paper proposes a Gird-Based Partition algorithm (GBP) as a data preparation method, and a Distributed LOF Computing method (DLC) for detecting density-based outliers in parallel, which only needs a small amount of network communications.

Journal ArticleDOI
TL;DR: This paper proposes an online tracking algorithm based on a novel robust linear regression estimator that models the error term with the Gaussian-Laplacian distribution, which can be efficiently solved and provides insights on the relationships among the LSS problem, Huber loss function, and trivial templates.
Abstract: In this paper, we propose an online tracking algorithm based on a novel robust linear regression estimator. In contrast to existing methods, the proposed least soft-threshold squares (LSS) algorithm models the error term with the Gaussian–Laplacian distribution, which can be efficiently solved. For visual tracking, the Gaussian–Laplacian noise assumption enables our LSS model to handle the normal appearance change and outlier simultaneously. Based on the maximum joint likelihood of parameters, we derive an LSS distance metric to measure the difference between an observation sample and a dictionary of positive templates. Compared with the distance derived from ordinary least squares methods, the proposed metric is more effective in dealing with the outliers. In addition, we provide insights on the relationships among the LSS problem, Huber loss function, and trivial templates, which facilitate better understandings of the existing tracking methods. Finally, we develop a robust tracking algorithm based on the LSS distance metric with an update scheme and negative templates, and speed it up with a particle selection mechanism. Experimental results on numerous challenging image sequences demonstrate that the proposed tracking algorithm performs favorably than the state-of-the-art methods.

Proceedings ArticleDOI
01 May 2016
TL;DR: This paper proposes a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem oftext clustering.
Abstract: Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.

Journal ArticleDOI
TL;DR: A novel technique called FEMI, which imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value, and applies a fuzzy clustering approach and the authors' novel fuzzy expectation maximization algorithm.
Abstract: Data preprocessing and cleansing play a vital role in data mining by ensuring good quality of data. Data-cleansing tasks include imputation of missing values, identification of outliers, and identification and correction of noisy data. In this paper, we present a novel technique called A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI). It imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value. While identifying a group of similar records and making a guess based on the group, it applies a fuzzy clustering approach and our novel fuzzy expectation maximization algorithm. We evaluate FEMI on eight publicly available natural data sets by comparing its performance with the performance of five high-quality existing techniques, namely EMI, GkNN, FKMI, SVR and IBLLS. We use thirty-two types (patterns) of missing values for each data set. Two evaluation criteria namely root mean squared error and mean absolute error are used. Our experimental results indicate (according to a confidence interval and $$t$$t test analysis) that FEMI performs significantly better than EMI, GkNN, FKMI, SVR, and IBLLS.