scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2006"


Journal ArticleDOI
TL;DR: A new method is described, which combines a new method of robust nonlinear regression with anew method of outlier identification, that identifies outliers from nonlinear curve fits with reasonable power and few false positives.
Abstract: Nonlinear regression, like linear regression, assumes that the scatter of data around the ideal curve follows a Gaussian or normal distribution. This assumption leads to the familiar goal of regression: to minimize the sum of the squares of the vertical or Y-value distances between the points and the curve. Outliers can dominate the sum-of-the-squares calculation, and lead to misleading results. However, we know of no practical method for routinely identifying outliers when fitting curves with nonlinear regression. We describe a new method for identifying outliers when fitting data with nonlinear regression. We first fit the data using a robust form of nonlinear regression, based on the assumption that scatter follows a Lorentzian distribution. We devised a new adaptive method that gradually becomes more robust as the method proceeds. To define outliers, we adapted the false discovery rate approach to handling multiple comparisons. We then remove the outliers, and analyze the data using ordinary least-squares regression. Because the method combines robust regression and outlier removal, we call it the ROUT method. When analyzing simulated data, where all scatter is Gaussian, our method detects (falsely) one or more outlier in only about 1–3% of experiments. When analyzing data contaminated with one or several outliers, the ROUT method performs well at outlier identification, with an average False Discovery Rate less than 1%. Our method, which combines a new method of robust nonlinear regression with a new method of outlier identification, identifies outliers from nonlinear curve fits with reasonable power and few false positives.

981 citations


Journal ArticleDOI
TL;DR: For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude.
Abstract: Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call `selective iteration' and `nested extensions'. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases.

574 citations


Proceedings ArticleDOI
01 Sep 2006
TL;DR: A framework that computes in a distributed fashion an approximation of multi-dimensional data distributions in order to enable complex applications in resource-constrained sensor networks and demonstrates the applicability of the technique to other related problems in sensor networks.
Abstract: Sensor networks have recently found many popular applications in a number of different settings. Sensors at different locations can generate streaming data, which can be analyzed in real-time to identify events of interest. In this paper, we propose a framework that computes in a distributed fashion an approximation of multi-dimensional data distributions in order to enable complex applications in resource-constrained sensor networks.We motivate our technique in the context of the problem of outlier detection. We demonstrate how our framework can be extended in order to identify either distance- or density-based outliers in a single pass over the data, and with limited memory requirements. Experiments with synthetic and real data show that our method is efficient and accurate, and compares favorably to other proposed techniques. We also demonstrate the applicability of our technique to other related problems in sensor networks.

457 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: This paper presents a novel approach to outlier detection based on classification, which is superior to other methods based on the same reduction to classification, but using standard classification methods, and shows that it is competitive to the state-of-the-art outlier Detection methods in the literature.
Abstract: Most existing approaches to outlier detection are based on density estimation methods. There are two notable issues with these methods: one is the lack of explanation for outlier flagging decisions, and the other is the relatively high computational requirement. In this paper, we present a novel approach to outlier detection based on classification, in an attempt to address both of these issues. Our approach isbased on two key ideas. First, we present a simple reduction of outlier detection to classification, via a procedure that involves applying classification to a labeled data set containing artificially generated examples that play the role of potential outliers. Once the task has been reduced to classification, we then invoke a selective sampling mechanism based on active learning to the reduced classification problem. We empirically evaluate the proposed approach using a number of data sets, and find that our method is superior to other methods based on the same reduction to classification, but using standard classification methods. We also show that it is competitive to the state-of-the-art outlier detection methods in the literature based on density estimation, while significantly improving the computational complexity and explanatory power.

357 citations


Book ChapterDOI
09 Apr 2006
TL;DR: A simple but effective measure on local outliers based on a symmetric neighborhood relationship that considers both neighbors and reverse neighbors of an object when estimating its density distribution and shows that it is more effective in ranking outliers.
Abstract: Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2,11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.

352 citations


Journal Article
TL;DR: In this article, the authors proposed a measure on local outliers based on a symmetric neighborhood relationship, which considers both neighbors and reverse neighbors of an object when estimating its density distribution.
Abstract: Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2,11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.

321 citations


Journal ArticleDOI
TL;DR: When data are contaminated with outliers, the use of the super-efficiency model to identify and remove outliers results in more accurate efficiency estimates than those obtained from the conventional DEA estimation model.

314 citations


Journal ArticleDOI
TL;DR: The New York Stock Exchange is chosen to provide evidence of problems affecting ultra high-frequency data sets and several methods of aggregation of the data are suggested, according to which corresponding time series of interest for econometric analysis can be constructed.

311 citations


Journal ArticleDOI
TL;DR: A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it that can be used to predict the outlierness of new unseen objects is proposed.
Abstract: A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented. A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data set guarantees a comparable quality of the prediction, but at a lower computational cost.

250 citations


Proceedings ArticleDOI
04 Jul 2006
TL;DR: This work develops an algorithm that is flexible with respect to the outlier definition, works in-network with a communication load proportional to the outcome, and reveals its outcome to all sensors.
Abstract: To address the problem of unsupervised outlier detection in wireless sensor networks, we develop an algorithm that (1) is flexible with respect to the outlier definition, (2) works in-network with a communication load proportional to the outcome, and (3) reveals its outcome to all sensors. We examine the algorithm’s performance using simulation with real sensor data streams. Our results demonstrate that the algorithm is accurate and imposes a reasonable communication load and level of power consumption.

247 citations


Proceedings ArticleDOI
11 Dec 2006
TL;DR: This paper applies one of the efficient data mining algorithms called random forests algorithm in anomaly based NIDSs, and presents the modification on the outlier detection algorithm of random forests that is comparable to previously reported unsupervised anomaly detection approaches evaluated over the KDD' 99 dataset.
Abstract: Anomaly detection is a critical issue in Network Intrusion Detection Systems (NIDSs). Most anomaly based NIDSs employ supervised algorithms, whose performances highly depend on attack-free training data. However, this kind of training data is difficult to obtain in real world network environment. Moreover, with changing network environment or services, patterns of normal traffic will be changed. This leads to high false positive rate of supervised NIDSs. Unsupervised outlier detection can overcome the drawbacks of supervised anomaly detection. Therefore, we apply one of the efficient data mining algorithms called random forests algorithm in anomaly based NIDSs. Without attack-free training data, random forests algorithm can detect outliers in datasets of network traffic. In this paper, we discuss our framework of anomaly based network intrusion detection. In the framework, patterns of network services are built by random forests algorithm over traffic data. Intrusions are detected by determining outliers related to the built patterns. We present the modification on the outlier detection algorithm of random forests. We also report our experimental results over the KDD'99 dataset. The results show that the proposed approach is comparable to previously reported unsupervised anomaly detection approaches evaluated over the KDD' 99 dataset.

Journal ArticleDOI
TL;DR: A tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets that are prone to concept drift and models of the data must be dynamic as well is presented.
Abstract: Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting However, there are still several challenges that must be addressed First, most approaches to date have focused on detecting outliers in a continuous attribute space However, almost all real-world data sets contain a mixture of categorical and continuous attributes Categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information Second, there have not been any general-purpose distributed outlier detection algorithms Most distributed detection algorithms are designed with a specific domain (eg sensor networks) in mind Third, the data sets being analyzed may be streaming or otherwise dynamic in nature Such data sets are prone to concept drift, and models of the data must be dynamic as well To address these challenges, we present a tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets

09 Aug 2006
TL;DR: This paper reviews and compares several common and less common outlier labeling methods and presents information that shows how the percent of outliers changes in each method according to the skewness and sample size of lognormal distributions through simulations and application to real data sets.
Abstract: Most real-world data sets contain outliers that have unusually large or small values when compared with others in the data set. Outliers may cause a negative effect on data analyses, such as ANOVA and regression, based on distribution assumptions, or may provide useful information about data when we look into an unusual response to a given study. Thus, outlier detection is an important part of data analysis in the above two cases. Several outlier labeling methods have been developed. Some methods are sensitive to extreme values, like the SD method, and others are resistant to extreme values, like Tukey's method. Although these methods are quite powerful with large normal data, it may be problematic to apply them to non-normal data or small sample sizes without knowledge of their characteristics in these circumstances. This is because each labeling method has different measures to detect outliers, and expected outlier percentages change differently according to the sample size or distribution type of the data. Many kinds of data regarding public health are often skewed, usually to the right, and lognormal distributions can often be applied to such skewed data, for instance, surgical procedure times, blood pressure, and assessment of toxic compounds in environmental analysis. This paper reviews and compares several common and less common outlier labeling methods and presents information that shows how the percent of outliers changes in each method according to the skewness and sample size of lognormal distributions through simulations and application to real data sets. These results may help establish guidelines for the choice of outlier detection methods in skewed data, which are often sen in the public health field.

Journal ArticleDOI
TL;DR: The M-quantile model as mentioned in this paper is based on modeling quantile-like parameters of the conditional distribution of the target variable given the covariates, which avoids the problems associated with specification of random effects, allowing inter-domain differences to be characterized by the variation of area-specific Mquantile coefficients.
Abstract: Small area estimation techniques are employed when sample data are insufficient for acceptably precise direct estimation in domains of interest. These techniques typically rely on regression models that use both covariates and random effects to explain variation between domains. However, such models also depend on strong distributional assumptions, require a formal specification of the random part of the model and do not easily allow for outlier robust inference. We describe a new approach to small area estimation that is based on modelling quantile-like parameters of the conditional distribution of the target variable given the covariates. This avoids the problems associated with specification of random effects, allowing inter-domain differences to be characterized by the variation of area-specific M-quantile coefficients. The proposed approach is easily made robust against outlying data values and can be adapted for estimation of a wide range of area specific parameters, including that of the quantiles of the distribution of the target variable in the different small areas. Results from two simulation studies comparing the performance of the M-quantile modelling approach with more traditional mixed model approaches are also provided.

Journal ArticleDOI
J. Takeuchi1, Kenji Yamanishi1
TL;DR: This paper presents a unifying framework for dealing with outlier detection and change point detection, which is incrementally learned using an online discounting learning algorithm and compared with conventional methods to demonstrate its validity through simulation and experimental applications to incidents detection in network security.
Abstract: We are concerned with the issue of detecting outliers and change points from time series. In the area of data mining, there have been increased interest in these issues since outlier detection is related to fraud detection, rare event discovery, etc., while change-point detection is related to event/trend change detection, activity monitoring, etc. Although, in most previous work, outlier detection and change point detection have not been related explicitly, this paper presents a unifying framework for dealing with both of them. In this framework, a probabilistic model of time series is incrementally learned using an online discounting learning algorithm, which can track a drifting data source adaptively by forgetting out-of-date statistics gradually. A score for any given data is calculated in terms of its deviation from the learned model, with a higher score indicating a high possibility of being an outlier. By taking an average of the scores over a window of a fixed length and sliding the window, we may obtain a new time series consisting of moving-averaged scores. Change point detection is then reduced to the issue of detecting outliers in that time series. We compare the performance of our framework with those of conventional methods to demonstrate its validity through simulation and experimental applications to incidents detection in network security.

Journal ArticleDOI
TL;DR: Evaluated the power and efficiency of a simple outlier approach and describes a genome-wide scan for positive selection using a dense catalog of 1.58 million SNPs that were genotyped in three human populations, finding several extended genomic regions were found that contained multiple contiguous candidate selection genes.
Abstract: Identifying regions of the human genome that have been targets of positive selection will provide important insights into recent human evolutionary history and may facilitate the search for complex disease genes. However, the confounding effects of population demographic history and selection on patterns of genetic variation complicate inferences of selection when a small number of loci are studied. To this end, identifying outlier loci from empirical genome-wide distributions of genetic variation is a promising strategy to detect targets of selection. Here, we evaluate the power and efficiency of a simple outlier approach and describe a genome-wide scan for positive selection using a dense catalog of 1.58 million SNPs that were genotyped in three human populations. In total, we analyzed 14,589 genes, 385 of which possess patterns of genetic variation consistent with the hypothesis of positive selection. Furthermore, several extended genomic regions were found, spanning >500 kb, that contained multiple contiguous candidate selection genes. More generally, these data provide important practical insights into the limits of outlier approaches in genome-wide scans for selection, provide strong candidate selection genes to study in greater detail, and may have important implications for disease related research.

Journal ArticleDOI
TL;DR: Four techniques intended for noise removal to enhance data analysis in the presence of high noise levels are explored, including a hyperclique-based data cleaner (HCleaner), which generally leads to better clustering performance and higher quality association patterns as the amount of noise being removed increases.
Abstract: Removing objects that are noisy is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the product of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amounts of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distance-based, clustering-based, and an approach based on the local outlier factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hyperclique-based data cleaner (HCleaner). These techniques are evaluated in terms of their impact on the subsequent data analysis, specifically, clustering and association analysis. Our experimental results show that all of these methods can provide better clustering performance and higher quality association patterns as the amount of noise being removed increases, although HCleaner generally leads to better clustering performance and higher quality associations than the other three methods for binary data.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: A generative model based approach to solve the multi-view stereo problem, which describes and compares two implementations of the E-step of the algorithm, which correspond to the Mean Field and Bethe approximations of the free energy.
Abstract: In this paper, we present a generative model based approach to solve the multi-view stereo problem. The input images are considered to be generated by either one of two processes: (i) an inlier process, which generates the pixels which are visible from the reference camera and which obey the constant brightness assumption, and (ii) an outlier process which generates all other pixels. Depth and visibility are jointly modelled as a hiddenMarkov Random Field, and the spatial correlations of both are explicitly accounted for. Inference is made tractable by an EM-algorithm, which alternates between estimation of visibility and depth, and optimisation of model parameters. We describe and compare two implementations of the E-step of the algorithm, which correspond to the Mean Field and Bethe approximations of the free energy. The approach is validated by experiments on challenging real-world scenes, of which two are contaminated by independently moving objects.

Journal ArticleDOI
TL;DR: In this paper, a robust projection-pursuit method for principal component analysis (PCA) is proposed for the analysis of chemical data, where the number of variables is typically large.
Abstract: Principal Component Analysis (PCA) is very sensitive in presence of outliers. One of the most appealing robust methods for principal component analysis uses the Projection-Pursuit principle. Here, one projects the data on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized. The Projection-Pursuit based method for principal component analysis has recently been introduced in the field of chemometrics, where the number of variables is typically large. In this paper, it is shown that the currently available algorithm for robust Projection-Pursuit PCA performs poor in presence of many variables. A new algorithm is proposed that is more suitable for the analysis of chemical data. Its performance is studied by means of simulation experiments and illustrated on some real datasets.

Journal ArticleDOI
TL;DR: In this paper, the results of a survey regarding how published researchers prefer to deal with outliers are presented, and a set of 183 test validity studies is examined to document the effects of different approaches to the detection and exclusion of outliers on effect size measures.
Abstract: Extreme data points, or outliers, can have a disproportionate influence on the conclusions drawn from a set of bivariate correlational data. This paper addresses two aspects of outlier detection. The results of a survey regarding how published researchers prefer to deal with outliers are presented, and a set of 183 test validity studies is examined to document the effects of different approaches to the detection and exclusion of outliers on effect size measures. The study indicates that: (a) there is disagreement among researchers as to the appropriateness of deleting data points from a study; (b) researchers report greater use of visual examination of data than of numeric diagnostic techniques for detecting outliers; and (c) while outlier removal influenced effect size measures in individual studies, outlying data points were not found to be a substantial source of variance in a large test validity data set.

Proceedings Article
01 Jan 2006
TL;DR: RBRP as discussed by the authors is a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets, which scales log-linearly as a function of the number of data points and linearly as an exponential function of number of dimensions.
Abstract: Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude.

Proceedings ArticleDOI
07 Aug 2006
TL;DR: An outlier detection scheme based on Bayesian belief networks, which captures the conditional dependencies among the observations of the attributes to detect the outliers in the sensor streamed data is proposed.
Abstract: Data reliability is an important issue from the user's perspective, in the context of streamed data in wireless sensor networks (WSN). Reliability is affected by the harsh environmental conditions, interferences in wireless medium and usage of low quality sensors. Due to these conditions, the data generated by the sensors may get corrupted resulting in outliers and missing values. Deciding whether an observation is an outlier or not depends on the behavior of the neighbors' readings as well as the readings of the sensor itself. This can be done by capturing the spatio-temporal correlations that exists among the observations of the sensor nodes. By using naive Bayesian networks for classification, we can estimate whether an observation belongs to a class or not. If it falls beyond the range of the class, then it can be detected as an outlier. However naive Bayesian networks do not consider the conditional dependencies among the observations of sensor attributes. So, we propose an outlier detection scheme based on Bayesian Belief Networks, which captures the conditional dependencies among the observations of the attributes to detect the outliers in the sensor streamed data. Applicability of this scheme as a plug-in to the Component Oriented Middleware for Sensor Networks (COMiS) of our early research work is also presented.

Proceedings Article
16 Jul 2006
TL;DR: It is shown how outlier detection can be encoded in the large margin training principle of support vector machines by expressing a convex relaxation of the joint training problem as a semide finite program, and can yield superior results to the standard soft margin approach in the presence of outliers.
Abstract: One of the well known risks of large margin training methods, such as boosting and support vector machines (SVMs), is their sensitivity to outliers. These risks are normally mitigated by using a soft margin criterion, such as hinge loss, to reduce outlier sensitivity. In this paper, we present a more direct approach that explicitly incorporates outlier suppression in the training process. In particular, we show how outlier detection can be encoded in the large margin training principle of support vector machines. By expressing a convex relaxation of the joint training problem as a semide finite program, one can use this approach to robustly train a support vector machine while suppressing outliers. We demonstrate that our approach can yield superior results to the standard soft margin approach in the presence of outliers.

Journal Article
TL;DR: In this article, the authors describe some of the more commonly used identification methods to evaluate outliers and to reduce their impact on the analysis of manufacturing process data, and present a survey of the most commonly used methods.
Abstract: Outliers may provide useful information about the development and manufacturing process. Analysts use various statistical methods to evaluate outliers and to reduce their impact on the analysis. This article describes some of the more commonly used identification methods.

Journal ArticleDOI
TL;DR: A novel method is proposed to compute the cluster radius threshold and a powerful clustering-based method is presented for the unsupervised intrusion detection (CBUID).

Proceedings ArticleDOI
07 Jun 2006
TL;DR: Experimental studies show that outlier detection technique using control chart is better than the technique modeled from linear regression because the number of outlier data detected by controlchart is smaller than linear regression.
Abstract: Existing studies in data mining mostly focus on finding patterns in large datasets and further using it for organizational decision making. However, finding such exceptions and outliers has not yet received as much attention in the data mining field as some other topics have, such as association rules, classification and clustering. Thus, this paper describes the performance of control chart, linear regression, and Manhattan distance techniques for outlier detection in data mining. Experimental studies show that outlier detection technique using control chart is better than the technique modeled from linear regression because the number of outlier data detected by control chart is smaller than linear regression. Further, experimental studies shows that Manhattan distance technique outperformed compared with the other techniques when the threshold values increased.

Journal ArticleDOI
01 Dec 2006
TL;DR: This survey discuses practical applications of outlier mining, provides a taxonomy for categorizing related mining techniques, and provides a comprehensive review of these techniques with their advantages and disadvantages.
Abstract: Data that appear to have different characteristics than the rest of the population are called outliers. Identifying outliers from huge data repositories is a very complex task called outlier mining. Outlier mining has been akin to finding needles in a haystack. However, outlier mining has a number of practical applications in areas such as fraud detection, network intrusion detection, and identification of competitor and emerging business trends in e-commerce. This survey discuses practical applications of outlier mining, and provides a taxonomy for categorizing related mining techniques. A comprehensive review of these techniques with their advantages and disadvantages along with some current research issues are provided.

Journal ArticleDOI
TL;DR: A novel detection algorithm, called High-Dimension Outlying subspace Detection (HighDOD), to detect the outlying subspaces of high-dimensional data efficiently and outperforms other searching alternatives such as the naive top–down, bottom–up and random search methods.
Abstract: In this paper, we identify a new task for studying the outlying degree (OD) of high-dimensional data, i.e. finding the subspaces (subsets of features) in which the given points are outliers, which are called their outlying subspaces. Since the state-of-the-art outlier detection techniques fail to handle this new problem, we propose a novel detection algorithm, called High-Dimension Outlying subspace Detection (HighDOD), to detect the outlying subspaces of high-dimensional data efficiently. The intuitive idea of HighDOD is that we measure the OD of the point using the sum of distances between this point and itsknearest neighbors. Two heuristic pruning strategies are proposed to realize fast pruning in the subspace search and an efficient dynamic subspace search method with a sample-based learning process has been implemented. Experimental results show that HighDOD is efficient and outperforms other searching alternatives such as the naive top–down, bottom–up and random search methods, and the existing outlier detection methods cannot fulfill this new task effectively.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: This paper presents two methods for transforming outlier scores into probabilities that models the score distributions as a mixture of exponential and Gaussian probability functions and calculates the posterior probabilites via the Bayes' rule.
Abstract: Current outlier detection schemes typically output a numeric score representing the degree to which a given observation is an outlier. We argue that converting the scores into well-calibrated probability estimates is more favorable for several reasons. First, the probability estimates allow us to select the appropriate threshold for declaring outliers using a Bayesian risk model. Second, the probability estimates obtained from individual models can be aggregated to build an ensemble outlier detection framework. In this paper, we present two methods for transforming outlier scores into probabilities. The first approach assumes that the posterior probabilities follow a logistic sigmoid function and learns the parameters of the function from the distribution of outlier scores. The second approach models the score distributions as a mixture of exponential and Gaussian probability functions and calculates the posterior probabilites via the Bayes' rule. We evaluated the efficacy of both methods in the context of threshold selection and ensemble outlier detection. We also show that the calibration accuracy improves with the aid of some labeled examples.

Journal ArticleDOI
TL;DR: An overview of robust chemometrical/statistical methods which search for the model fitted by the majority of the data, and hence are far less affected by outliers, is presented.
Abstract: In analytical chemistry, experimental data often contain outliers of one type or another. The most often used chemometrical/statistical techniques are sensitive to such outliers, and the results may be adversely affected by them. This paper presents an overview of robust chemometrical/statistical methods which search for the model fitted by the majority of the data, and hence are far less affected by outliers. As an extra benefit, we can then detect the outliers by their large deviation from the robust fit. We discuss robust procedures for estimating location and scatter, and for performing multiple linear regression, PCA, PCR, PLS, and classification. We also describe recent results concerning the robustness of Support Vector Machines, which are kernel-based methods for fitting non-linear models. Finally, we present robust approaches for the analysis of multiway data.