Showing papers on "Outlier published in 2006"

PDF

Open Access

Journal Article•DOI•

Detecting outliers when fitting data with nonlinear regression – a new method based on robust nonlinear regression and the false discovery rate

[...]

Harvey J Motulsky, Ronald E Brown

09 Mar 2006-BMC Bioinformatics

TL;DR: A new method is described, which combines a new method of robust nonlinear regression with anew method of outlier identification, that identifies outliers from nonlinear curve fits with reasonable power and few false positives.

...read moreread less

Abstract: Nonlinear regression, like linear regression, assumes that the scatter of data around the ideal curve follows a Gaussian or normal distribution. This assumption leads to the familiar goal of regression: to minimize the sum of the squares of the vertical or Y-value distances between the points and the curve. Outliers can dominate the sum-of-the-squares calculation, and lead to misleading results. However, we know of no practical method for routinely identifying outliers when fitting curves with nonlinear regression. We describe a new method for identifying outliers when fitting data with nonlinear regression. We first fit the data using a robust form of nonlinear regression, based on the assumption that scatter follows a Lorentzian distribution. We devised a new adaptive method that gradually becomes more robust as the method proceeds. To define outliers, we adapted the false discovery rate approach to handling multiple comparisons. We then remove the outliers, and analyze the data using ordinary least-squares regression. Because the method combines robust regression and outlier removal, we call it the ROUT method. When analyzing simulated data, where all scatter is Gaussian, our method detects (falsely) one or more outlier in only about 1–3% of experiments. When analyzing data contaminated with one or several outliers, the ROUT method performs well at outlier identification, with an average False Discovery Rate less than 1%. Our method, which combines a new method of robust nonlinear regression with a new method of outlier identification, identifies outliers from nonlinear curve fits with reasonable power and few false positives.

...read moreread less

981 citations

Journal Article•DOI•

Computing LTS Regression for Large Data Sets

[...]

Peter J. Rousseeuw¹, Katrien Van Driessen¹•Institutions (1)

University of Antwerp¹

01 Jan 2006-Data Mining and Knowledge Discovery

TL;DR: For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude.

...read moreread less

Abstract: Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call `selective iteration' and `nested extensions'. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases.

...read moreread less

574 citations

Proceedings Article•DOI•

Online outlier detection in sensor data using non-parametric models

[...]

Sharmila Subramaniam¹, Themis Palpanas², Dimitris Papadopoulos¹, Vana Kalogeraki¹, Dimitrios Gunopulos¹ - Show less +1 more•Institutions (2)

University of California, Riverside¹, IBM²

01 Sep 2006

TL;DR: A framework that computes in a distributed fashion an approximation of multi-dimensional data distributions in order to enable complex applications in resource-constrained sensor networks and demonstrates the applicability of the technique to other related problems in sensor networks.

...read moreread less

Abstract: Sensor networks have recently found many popular applications in a number of different settings. Sensors at different locations can generate streaming data, which can be analyzed in real-time to identify events of interest. In this paper, we propose a framework that computes in a distributed fashion an approximation of multi-dimensional data distributions in order to enable complex applications in resource-constrained sensor networks.We motivate our technique in the context of the problem of outlier detection. We demonstrate how our framework can be extended in order to identify either distance- or density-based outliers in a single pass over the data, and with limited memory requirements. Experiments with synthetic and real data show that our method is efficient and accurate, and compares favorably to other proposed techniques. We also demonstrate the applicability of our technique to other related problems in sensor networks.

...read moreread less

457 citations

Proceedings Article•DOI•

Outlier detection by active learning

[...]

Naoki Abe¹, Bianca Zadrozny², John Langford³•Institutions (3)

IBM¹, Federal Fluminense University², Toyota Technological Institute at Chicago³

20 Aug 2006

TL;DR: This paper presents a novel approach to outlier detection based on classification, which is superior to other methods based on the same reduction to classification, but using standard classification methods, and shows that it is competitive to the state-of-the-art outlier Detection methods in the literature.

...read moreread less

Abstract: Most existing approaches to outlier detection are based on density estimation methods. There are two notable issues with these methods: one is the lack of explanation for outlier flagging decisions, and the other is the relatively high computational requirement. In this paper, we present a novel approach to outlier detection based on classification, in an attempt to address both of these issues. Our approach isbased on two key ideas. First, we present a simple reduction of outlier detection to classification, via a procedure that involves applying classification to a labeled data set containing artificially generated examples that play the role of potential outliers. Once the task has been reduced to classification, we then invoke a selective sampling mechanism based on active learning to the reduced classification problem. We empirically evaluate the proposed approach using a number of data sets, and find that our method is superior to other methods based on the same reduction to classification, but using standard classification methods. We also show that it is competitive to the state-of-the-art outlier detection methods in the literature based on density estimation, while significantly improving the computational complexity and explanatory power.

...read moreread less

357 citations

Book Chapter•DOI•

Ranking outliers using symmetric neighborhood relationship

[...]

Wen Jin¹, Anthony K. H. Tung², Jiawei Han³, Wei Wang⁴•Institutions (4)

Simon Fraser University¹, National University of Singapore², University of Illinois at Urbana–Champaign³, Fudan University⁴

09 Apr 2006

TL;DR: A simple but effective measure on local outliers based on a symmetric neighborhood relationship that considers both neighbors and reverse neighbors of an object when estimating its density distribution and shows that it is more effective in ranking outliers.

...read moreread less

Abstract: Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2,11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.

...read moreread less

352 citations

Journal Article•

Ranking outliers using symmetric neighborhood relationship

[...]

Wen Jin, Anthony K. H. Tung, Jiawei Han, Wei Wang

01 Jan 2006-Lecture Notes in Computer Science

TL;DR: In this article, the authors proposed a measure on local outliers based on a symmetric neighborhood relationship, which considers both neighbors and reverse neighbors of an object when estimating its density distribution.

...read moreread less

321 citations

Journal Article•DOI•

The super-efficiency procedure for outlier identification, not for ranking efficient units

[...]

Rajiv D. Banker¹, Hsihui Chang²•Institutions (2)

Temple University¹, Syracuse University²

01 Dec 2006-European Journal of Operational Research

TL;DR: When data are contaminated with outliers, the use of the super-efficiency model to identify and remove outliers results in more accurate efficiency estimates than those obtained from the conventional DEA estimation model.

...read moreread less

314 citations

Journal Article•DOI•

Financial econometric analysis at ultra-high frequency: Data handling concerns

[...]

Christian T. Brownlees, Giampiero M. Gallo

01 Dec 2006-Computational Statistics & Data Analysis

TL;DR: The New York Stock Exchange is chosen to provide evidence of problems affecting ultra high-frequency data sets and several methods of aggregation of the data are suggested, according to which corresponding time series of interest for econometric analysis can be constructed.

...read moreread less

311 citations

Journal Article•DOI•

Distance-based detection and prediction of outliers

[...]

Fabrizio Angiulli, Stefano Basta, Clara Pizzuti

01 Feb 2006-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it that can be used to predict the outlierness of new unseen objects is proposed.

...read moreread less

Abstract: A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented. A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data set guarantees a comparable quality of the prediction, but at a lower computational cost.

...read moreread less

250 citations

Proceedings Article•DOI•

In-Network Outlier Detection in Wireless Sensor Networks

[...]

Joel W. Branch¹, Boleslaw K. Szymanski¹, Chris Giannella², Ran Wolff², Hillol Kargupta² - Show less +1 more•Institutions (2)

Rensselaer Polytechnic Institute¹, University of Maryland, Baltimore County²

04 Jul 2006

TL;DR: This work develops an algorithm that is flexible with respect to the outlier definition, works in-network with a communication load proportional to the outcome, and reveals its outcome to all sensors.

...read moreread less

Abstract: To address the problem of unsupervised outlier detection in wireless sensor networks, we develop an algorithm that (1) is flexible with respect to the outlier definition, (2) works in-network with a communication load proportional to the outcome, and (3) reveals its outcome to all sensors. We examine the algorithms performance using simulation with real sensor data streams. Our results demonstrate that the algorithm is accurate and imposes a reasonable communication load and level of power consumption.

...read moreread less

247 citations

Proceedings Article•DOI•

Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection

[...]

Jiong Zhang¹, Mohammad Zulkernine¹•Institutions (1)

Queen's University¹

11 Dec 2006

TL;DR: This paper applies one of the efficient data mining algorithms called random forests algorithm in anomaly based NIDSs, and presents the modification on the outlier detection algorithm of random forests that is comparable to previously reported unsupervised anomaly detection approaches evaluated over the KDD' 99 dataset.

...read moreread less

Abstract: Anomaly detection is a critical issue in Network Intrusion Detection Systems (NIDSs). Most anomaly based NIDSs employ supervised algorithms, whose performances highly depend on attack-free training data. However, this kind of training data is difficult to obtain in real world network environment. Moreover, with changing network environment or services, patterns of normal traffic will be changed. This leads to high false positive rate of supervised NIDSs. Unsupervised outlier detection can overcome the drawbacks of supervised anomaly detection. Therefore, we apply one of the efficient data mining algorithms called random forests algorithm in anomaly based NIDSs. Without attack-free training data, random forests algorithm can detect outliers in datasets of network traffic. In this paper, we discuss our framework of anomaly based network intrusion detection. In the framework, patterns of network services are built by random forests algorithm over traffic data. Intrusions are detected by determining outliers related to the built patterns. We present the modification on the outlier detection algorithm of random forests. We also report our experimental results over the KDD'99 dataset. The results show that the proposed approach is comparable to previously reported unsupervised anomaly detection approaches evaluated over the KDD' 99 dataset.

...read moreread less

Journal Article•DOI•

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

[...]

Matthew Eric Otey¹, Amol Ghoting¹, Srinivasan Parthasarathy¹•Institutions (1)

Ohio State University¹

01 May 2006-Data Mining and Knowledge Discovery

TL;DR: A tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets that are prone to concept drift and models of the data must be dynamic as well is presented.

...read moreread less

Abstract: Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting However, there are still several challenges that must be addressed First, most approaches to date have focused on detecting outliers in a continuous attribute space However, almost all real-world data sets contain a mixture of categorical and continuous attributes Categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information Second, there have not been any general-purpose distributed outlier detection algorithms Most distributed detection algorithms are designed with a specific domain (eg sensor networks) in mind Third, the data sets being analyzed may be streaming or otherwise dynamic in nature Such data sets are prone to concept drift, and models of the data must be dynamic as well To address these challenges, we present a tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets

...read moreread less

A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets

[...]

Songwon Seo

09 Aug 2006

TL;DR: This paper reviews and compares several common and less common outlier labeling methods and presents information that shows how the percent of outliers changes in each method according to the skewness and sample size of lognormal distributions through simulations and application to real data sets.

...read moreread less

Abstract: Most real-world data sets contain outliers that have unusually large or small values when compared with others in the data set. Outliers may cause a negative effect on data analyses, such as ANOVA and regression, based on distribution assumptions, or may provide useful information about data when we look into an unusual response to a given study. Thus, outlier detection is an important part of data analysis in the above two cases. Several outlier labeling methods have been developed. Some methods are sensitive to extreme values, like the SD method, and others are resistant to extreme values, like Tukey's method. Although these methods are quite powerful with large normal data, it may be problematic to apply them to non-normal data or small sample sizes without knowledge of their characteristics in these circumstances. This is because each labeling method has different measures to detect outliers, and expected outlier percentages change differently according to the sample size or distribution type of the data. Many kinds of data regarding public health are often skewed, usually to the right, and lognormal distributions can often be applied to such skewed data, for instance, surgical procedure times, blood pressure, and assessment of toxic compounds in environmental analysis. This paper reviews and compares several common and less common outlier labeling methods and presents information that shows how the percent of outliers changes in each method according to the skewness and sample size of lognormal distributions through simulations and application to real data sets. These results may help establish guidelines for the choice of outlier detection methods in skewed data, which are often sen in the public health field.

...read moreread less

Journal Article•DOI•

M-quantile models for small area estimation

[...]

Ray Chambers, Nikos Tzavidis

01 Jun 2006-Biometrika

TL;DR: The M-quantile model as mentioned in this paper is based on modeling quantile-like parameters of the conditional distribution of the target variable given the covariates, which avoids the problems associated with specification of random effects, allowing inter-domain differences to be characterized by the variation of area-specific Mquantile coefficients.

...read moreread less

Abstract: Small area estimation techniques are employed when sample data are insufficient for acceptably precise direct estimation in domains of interest. These techniques typically rely on regression models that use both covariates and random effects to explain variation between domains. However, such models also depend on strong distributional assumptions, require a formal specification of the random part of the model and do not easily allow for outlier robust inference. We describe a new approach to small area estimation that is based on modelling quantile-like parameters of the conditional distribution of the target variable given the covariates. This avoids the problems associated with specification of random effects, allowing inter-domain differences to be characterized by the variation of area-specific M-quantile coefficients. The proposed approach is easily made robust against outlying data values and can be adapted for estimation of a wide range of area specific parameters, including that of the quantiles of the distribution of the target variable in the different small areas. Results from two simulation studies comparing the performance of the M-quantile modelling approach with more traditional mixed model approaches are also provided.

...read moreread less

Journal Article•DOI•

A unifying framework for detecting outliers and change points from time series

[...]

J. Takeuchi¹, Kenji Yamanishi¹•Institutions (1)

NEC¹

01 Apr 2006-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents a unifying framework for dealing with outlier detection and change point detection, which is incrementally learned using an online discounting learning algorithm and compared with conventional methods to demonstrate its validity through simulation and experimental applications to incidents detection in network security.

...read moreread less

Abstract: We are concerned with the issue of detecting outliers and change points from time series. In the area of data mining, there have been increased interest in these issues since outlier detection is related to fraud detection, rare event discovery, etc., while change-point detection is related to event/trend change detection, activity monitoring, etc. Although, in most previous work, outlier detection and change point detection have not been related explicitly, this paper presents a unifying framework for dealing with both of them. In this framework, a probabilistic model of time series is incrementally learned using an online discounting learning algorithm, which can track a drifting data source adaptively by forgetting out-of-date statistics gradually. A score for any given data is calculated in terms of its deviation from the learned model, with a higher score indicating a high possibility of being an outlier. By taking an average of the scores over a window of a fixed length and sliding the window, we may obtain a new time series consisting of moving-averaged scores. Change point detection is then reduced to the issue of detecting outliers in that time series. We compare the performance of our framework with those of conventional methods to demonstrate its validity through simulation and experimental applications to incidents detection in network security.

...read moreread less

Journal Article•DOI•

Genomic signatures of positive selection in humans and the limits of outlier approaches.

[...]

Joanna L. Kelley¹, Jennifer Madeoy¹, John Calhoun¹, Willie J. Swanson¹, Joshua M. Akey¹ - Show less +1 more•Institutions (1)

University of Washington¹

01 Aug 2006-Genome Research

TL;DR: Evaluated the power and efficiency of a simple outlier approach and describes a genome-wide scan for positive selection using a dense catalog of 1.58 million SNPs that were genotyped in three human populations, finding several extended genomic regions were found that contained multiple contiguous candidate selection genes.

...read moreread less

Abstract: Identifying regions of the human genome that have been targets of positive selection will provide important insights into recent human evolutionary history and may facilitate the search for complex disease genes. However, the confounding effects of population demographic history and selection on patterns of genetic variation complicate inferences of selection when a small number of loci are studied. To this end, identifying outlier loci from empirical genome-wide distributions of genetic variation is a promising strategy to detect targets of selection. Here, we evaluate the power and efficiency of a simple outlier approach and describe a genome-wide scan for positive selection using a dense catalog of 1.58 million SNPs that were genotyped in three human populations. In total, we analyzed 14,589 genes, 385 of which possess patterns of genetic variation consistent with the hypothesis of positive selection. Furthermore, several extended genomic regions were found, spanning >500 kb, that contained multiple contiguous candidate selection genes. More generally, these data provide important practical insights into the limits of outlier approaches in genome-wide scans for selection, provide strong candidate selection genes to study in greater detail, and may have important implications for disease related research.

...read moreread less

Journal Article•DOI•

Enhancing data analysis with noise removal

[...]

Hui Xiong¹, Gaurav Pandey, Michael Steinbach², Vipin Kumar•Institutions (2)

Rutgers University¹, IEEE Computer Society²

01 Mar 2006-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Four techniques intended for noise removal to enhance data analysis in the presence of high noise levels are explored, including a hyperclique-based data cleaner (HCleaner), which generally leads to better clustering performance and higher quality association patterns as the amount of noise being removed increases.

...read moreread less

Abstract: Removing objects that are noisy is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the product of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amounts of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distance-based, clustering-based, and an approach based on the local outlier factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hyperclique-based data cleaner (HCleaner). These techniques are evaluated in terms of their impact on the subsequent data analysis, specifically, clustering and association analysis. Our experimental results show that all of these methods can provide better clustering performance and higher quality association patterns as the amount of noise being removed increases, although HCleaner generally leads to better clustering performance and higher quality associations than the other three methods for binary data.

...read moreread less

Proceedings Article•DOI•

Combined Depth and Outlier Estimation in Multi-View Stereo

[...]

Christoph Strecha¹, Rik Fransens¹, L. Van Gool¹•Institutions (1)

Katholieke Universiteit Leuven¹

17 Jun 2006

TL;DR: A generative model based approach to solve the multi-view stereo problem, which describes and compares two implementations of the E-step of the algorithm, which correspond to the Mean Field and Bethe approximations of the free energy.

...read moreread less

Abstract: In this paper, we present a generative model based approach to solve the multi-view stereo problem. The input images are considered to be generated by either one of two processes: (i) an inlier process, which generates the pixels which are visible from the reference camera and which obey the constant brightness assumption, and (ii) an outlier process which generates all other pixels. Depth and visibility are jointly modelled as a hiddenMarkov Random Field, and the spatial correlations of both are explicitly accounted for. Inference is made tractable by an EM-algorithm, which alternates between estimation of visibility and depth, and optimisation of model parameters. We describe and compare two implementations of the E-step of the algorithm, which correspond to the Mean Field and Bethe approximations of the free energy. The approach is validated by experiments on challenging real-world scenes, of which two are contaminated by independently moving objects.

...read moreread less

Journal Article•DOI•

Algorithms for Projection-Pursuit Robust Principal Component Analysis

[...]

Christophe Croux¹, Peter Filzmoser², M. Rosário Oliveira³•Institutions (3)

Katholieke Universiteit Leuven¹, Vienna University of Technology², Instituto Superior Técnico³

01 Jan 2006-Social Science Research Network

TL;DR: In this paper, a robust projection-pursuit method for principal component analysis (PCA) is proposed for the analysis of chemical data, where the number of variables is typically large.

...read moreread less

Abstract: Principal Component Analysis (PCA) is very sensitive in presence of outliers. One of the most appealing robust methods for principal component analysis uses the Projection-Pursuit principle. Here, one projects the data on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized. The Projection-Pursuit based method for principal component analysis has recently been introduced in the field of chemometrics, where the number of variables is typically large. In this paper, it is shown that the currently available algorithm for robust Projection-Pursuit PCA performs poor in presence of many variables. A new algorithm is proposed that is more suitable for the analysis of chemical data. Its performance is studied by means of simulation experiments and illustrated on some real datasets.

...read moreread less

Journal Article•DOI•

Outlier detection and treatment in i/o psychology: a survey of researcher beliefs and an empirical illustration

[...]

John M. Orr, Paul R. Sackett¹, Cathy L. Z. DuBois¹•Institutions (1)

University of Minnesota¹

07 Dec 2006-Personnel Psychology

TL;DR: In this paper, the results of a survey regarding how published researchers prefer to deal with outliers are presented, and a set of 183 test validity studies is examined to document the effects of different approaches to the detection and exclusion of outliers on effect size measures.

...read moreread less

Abstract: Extreme data points, or outliers, can have a disproportionate influence on the conclusions drawn from a set of bivariate correlational data. This paper addresses two aspects of outlier detection. The results of a survey regarding how published researchers prefer to deal with outliers are presented, and a set of 183 test validity studies is examined to document the effects of different approaches to the detection and exclusion of outliers on effect size measures. The study indicates that: (a) there is disagreement among researchers as to the appropriateness of deleting data points from a study; (b) researchers report greater use of visual examination of data than of numeric diagnostic techniques for detecting outliers; and (c) while outlier removal influenced effect size measures in individual studies, outlying data points were not found to be a substantial source of variance in a large test validity data set.

...read moreread less

Proceedings Article•

Fast Mining of Distance-Based Outliers in High Dimensional Datasets.

[...]

Amol Ghoting¹, Srinivasan Parthasarathy², Matthew Eric Otey³•Institutions (3)

IBM¹, Ohio State University², Google³

01 Jan 2006

TL;DR: RBRP as discussed by the authors is a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets, which scales log-linearly as a function of the number of data points and linearly as an exponential function of number of dimensions.

...read moreread less

Abstract: Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude.

...read moreread less

Proceedings Article•DOI•

Outlier Detection in Wireless Sensor Networks using Bayesian Belief Networks

[...]

Dharanipragada Janakiram¹, Adi Mallikarjuna Reddy Vanteddu², Amit Kumar²•Institutions (2)

Indian Institute of Technology Madras¹, Indian Institutes of Technology²

07 Aug 2006

TL;DR: An outlier detection scheme based on Bayesian belief networks, which captures the conditional dependencies among the observations of the attributes to detect the outliers in the sensor streamed data is proposed.

...read moreread less

Abstract: Data reliability is an important issue from the user's perspective, in the context of streamed data in wireless sensor networks (WSN). Reliability is affected by the harsh environmental conditions, interferences in wireless medium and usage of low quality sensors. Due to these conditions, the data generated by the sensors may get corrupted resulting in outliers and missing values. Deciding whether an observation is an outlier or not depends on the behavior of the neighbors' readings as well as the readings of the sensor itself. This can be done by capturing the spatio-temporal correlations that exists among the observations of the sensor nodes. By using naive Bayesian networks for classification, we can estimate whether an observation belongs to a class or not. If it falls beyond the range of the class, then it can be detected as an outlier. However naive Bayesian networks do not consider the conditional dependencies among the observations of sensor attributes. So, we propose an outlier detection scheme based on Bayesian Belief Networks, which captures the conditional dependencies among the observations of the attributes to detect the outliers in the sensor streamed data. Applicability of this scheme as a plug-in to the Component Oriented Middleware for Sensor Networks (COMiS) of our early research work is also presented.

...read moreread less

Proceedings Article•

Robust support vector machine training via convex outlier ablation

[...]

Linli Xu¹, Koby Crammer², Dale Schuurmans³•Institutions (3)

University of Waterloo¹, University of Pennsylvania², University of Alberta³

16 Jul 2006

TL;DR: It is shown how outlier detection can be encoded in the large margin training principle of support vector machines by expressing a convex relaxation of the joint training problem as a semide finite program, and can yield superior results to the standard soft margin approach in the presence of outliers.

...read moreread less

Abstract: One of the well known risks of large margin training methods, such as boosting and support vector machines (SVMs), is their sensitivity to outliers. These risks are normally mitigated by using a soft margin criterion, such as hinge loss, to reduce outlier sensitivity. In this paper, we present a more direct approach that explicitly incorporates outlier suppression in the training process. In particular, we show how outlier detection can be encoded in the large margin training principle of support vector machines. By expressing a convex relaxation of the joint training problem as a semide finite program, one can use this approach to robustly train a support vector machine while suppressing outliers. We demonstrate that our approach can yield superior results to the standard soft margin approach in the presence of outliers.

...read moreread less

Journal Article•

A review of statistical outlier methods

[...]

Steven Walfish

01 Jan 2006-Pharmaceutical technology

TL;DR: In this article, the authors describe some of the more commonly used identification methods to evaluate outliers and to reduce their impact on the analysis of manufacturing process data, and present a survey of the most commonly used methods.

...read moreread less

Abstract: Outliers may provide useful information about the development and manufacturing process. Analysts use various statistical methods to evaluate outliers and to reduce their impact on the analysis. This article describes some of the more commonly used identification methods.

...read moreread less

Journal Article•DOI•

A clustering-based method for unsupervised intrusion detections

[...]

ShengYi Jiang¹, Xiaoyu Song², Hui Wang, Jian-Jun Han³, Qing-Hua Li³ - Show less +1 more•Institutions (3)

Guangdong University of Foreign Studies¹, Portland State University², Huazhong University of Science and Technology³

01 May 2006-Pattern Recognition Letters

TL;DR: A novel method is proposed to compute the cluster radius threshold and a powerful clustering-based method is presented for the unsupervised intrusion detection (CBUID).

...read moreread less

Proceedings Article•DOI•

A Comparative Study for Outlier Detection Techniques in Data Mining

[...]

Zuriana Abu Bakar, Rosmayati Mohemad, Aijaz Ahmad, Mustafa Mat Deris

07 Jun 2006

TL;DR: Experimental studies show that outlier detection technique using control chart is better than the technique modeled from linear regression because the number of outlier data detected by controlchart is smaller than linear regression.

...read moreread less

Abstract: Existing studies in data mining mostly focus on finding patterns in large datasets and further using it for organizational decision making. However, finding such exceptions and outliers has not yet received as much attention in the data mining field as some other topics have, such as association rules, classification and clustering. Thus, this paper describes the performance of control chart, linear regression, and Manhattan distance techniques for outlier detection in data mining. Experimental studies show that outlier detection technique using control chart is better than the technique modeled from linear regression because the number of outlier data detected by control chart is smaller than linear regression. Further, experimental studies shows that Manhattan distance technique outperformed compared with the other techniques when the threshold values increased.

...read moreread less

Journal Article•DOI•

A comprehensive survey of numeric and symbolic outlier mining techniques

[...]

Malik Agyemang¹, Ken Barker¹, Rada S. Alhajj¹•Institutions (1)

University of Calgary¹

01 Dec 2006

TL;DR: This survey discuses practical applications of outlier mining, provides a taxonomy for categorizing related mining techniques, and provides a comprehensive review of these techniques with their advantages and disadvantages.

...read moreread less

Abstract: Data that appear to have different characteristics than the rest of the population are called outliers. Identifying outliers from huge data repositories is a very complex task called outlier mining. Outlier mining has been akin to finding needles in a haystack. However, outlier mining has a number of practical applications in areas such as fraud detection, network intrusion detection, and identification of competitor and emerging business trends in e-commerce. This survey discuses practical applications of outlier mining, and provides a taxonomy for categorizing related mining techniques. A comprehensive review of these techniques with their advantages and disadvantages along with some current research issues are provided.

...read moreread less

Journal Article•DOI•

Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance

[...]

Ji Zhang¹, Hai Wang²•Institutions (2)

Dalhousie University¹, Saint Mary's University²

02 Oct 2006-Knowledge and Information Systems

TL;DR: A novel detection algorithm, called High-Dimension Outlying subspace Detection (HighDOD), to detect the outlying subspaces of high-dimensional data efficiently and outperforms other searching alternatives such as the naive top–down, bottom–up and random search methods.

...read moreread less

Abstract: In this paper, we identify a new task for studying the outlying degree (OD) of high-dimensional data, i.e. finding the subspaces (subsets of features) in which the given points are outliers, which are called their outlying subspaces. Since the state-of-the-art outlier detection techniques fail to handle this new problem, we propose a novel detection algorithm, called High-Dimension Outlying subspace Detection (HighDOD), to detect the outlying subspaces of high-dimensional data efficiently. The intuitive idea of HighDOD is that we measure the OD of the point using the sum of distances between this point and itsknearest neighbors. Two heuristic pruning strategies are proposed to realize fast pruning in the subspace search and an efficient dynamic subspace search method with a sample-based learning process has been implemented. Experimental results show that HighDOD is efficient and outperforms other searching alternatives such as the naive top–down, bottom–up and random search methods, and the existing outlier detection methods cannot fulfill this new task effectively.

...read moreread less

Proceedings Article•DOI•

Converting Output Scores from Outlier Detection Algorithms into Probability Estimates

[...]

Jing Gao¹, Pang-Ning Tan¹•Institutions (1)

Michigan State University¹

18 Dec 2006

TL;DR: This paper presents two methods for transforming outlier scores into probabilities that models the score distributions as a mixture of exponential and Gaussian probability functions and calculates the posterior probabilites via the Bayes' rule.

...read moreread less

Abstract: Current outlier detection schemes typically output a numeric score representing the degree to which a given observation is an outlier. We argue that converting the scores into well-calibrated probability estimates is more favorable for several reasons. First, the probability estimates allow us to select the appropriate threshold for declaring outliers using a Bayesian risk model. Second, the probability estimates obtained from individual models can be aggregated to build an ensemble outlier detection framework. In this paper, we present two methods for transforming outlier scores into probabilities. The first approach assumes that the posterior probabilities follow a logistic sigmoid function and learns the parameters of the function from the distribution of outlier scores. The second approach models the score distributions as a mixture of exponential and Gaussian probability functions and calculates the posterior probabilites via the Bayes' rule. We evaluated the efficacy of both methods in the context of threshold selection and ensemble outlier detection. We also show that the calibration accuracy improves with the aid of some labeled examples.

...read moreread less

Journal Article•DOI•

Robustness and Outlier Detection in Chemometrics

[...]

Peter J. Rousseeuw¹, Michiel Debruyne², Sanne Engelen², Mia Hubert²•Institutions (2)

University of Antwerp¹, Katholieke Universiteit Leuven²

01 Jan 2006-Critical Reviews in Analytical Chemistry

TL;DR: An overview of robust chemometrical/statistical methods which search for the model fitted by the majority of the data, and hence are far less affected by outliers, is presented.

...read moreread less

Abstract: In analytical chemistry, experimental data often contain outliers of one type or another. The most often used chemometrical/statistical techniques are sensitive to such outliers, and the results may be adversely affected by them. This paper presents an overview of robust chemometrical/statistical methods which search for the model fitted by the majority of the data, and hence are far less affected by outliers. As an extra benefit, we can then detect the outliers by their large deviation from the robust fit. We discuss robust procedures for estimating location and scatter, and for performing multiple linear regression, PCA, PCR, PLS, and classification. We also describe recent results concerning the robustness of Support Vector Machines, which are kernel-based methods for fitting non-linear models. Finally, we present robust approaches for the analysis of multiway data.

...read moreread less

Collapse