Showing papers on "Outlier published in 2004"

PDF

Open Access

Journal Article•DOI•

A Survey of Outlier Detection Methodologies

[...]

Victoria J. Hodge¹, Jim Austin¹•Institutions (1)

01 Oct 2004-Artificial Intelligence Review

TL;DR: A survey of contemporary techniques for outlier detection is introduced and their respective motivations are identified and distinguish their advantages and disadvantages in a comparative review.

...read moreread less

Abstract: Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.

...read moreread less

3,235 citations

Journal Article•DOI•

Is cross-validation valid for small-sample microarray classification?

[...]

Ulisses Braga-Neto, Edward R. Dougherty¹•Institutions (1)

University of Texas MD Anderson Cancer Center¹

12 Feb 2004-Bioinformatics

TL;DR: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules-linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)-using both synthetic and real breast-cancer patient data.

...read moreread less

Abstract: Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules---linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)---using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.

...read moreread less

598 citations

Journal Article•DOI•

On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms

[...]

Kenji Yamanishi¹, Jun'ichi Takeuchi¹, Graham J. Williams², Peter A. Milne²•Institutions (2)

NEC¹, Commonwealth Scientific and Industrial Research Organisation²

01 May 2004-Data Mining and Knowledge Discovery

TL;DR: An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs.

...read moreread less

Abstract: Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of datas (2) a score has a clear statistical/information-theoretic meanings (3) it is computationally inexpensives and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

...read moreread less

592 citations

Journal Article•DOI•

On-line outlier detection and data cleaning

[...]

Hancong Liu¹, Sirish L. Shah¹, Wei Jiang²•Institutions (2)

University of Alberta¹, Stevens Institute of Technology²

15 Aug 2004-Computers & Chemical Engineering

TL;DR: The proposed data filter-cleaner includes an on-li ne outlier-resistant estimate of the process model and combines it with a modified Kalman filter to detect and “clean” outliers.

...read moreread less

424 citations

Journal Article•DOI•

Parameter estimation in stochastic grey-box models

[...]

Niels Rode Kristensen¹, Henrik Madsen¹, Sten Bay Jørgensen¹•Institutions (1)

Technical University of Denmark¹

01 Feb 2004-Automatica

TL;DR: An efficient and flexible parameter estimation scheme for grey-box models in the sense of discretely, partially observed Ito stochastic differential equations with measurement noise is presented along with a corresponding software implementation that provides more accurate and more consistent estimates of the parameters of the diffusion term.

...read moreread less

351 citations

Journal Article•DOI•

The generalized LASSO

[...]

Volker Roth¹•Institutions (1)

University of Bonn¹

01 Jan 2004-IEEE Transactions on Neural Networks

TL;DR: This paper presents a different class of kernel regressors that effectively overcome the above problems, and presents a highly efficient algorithm with guaranteed global convergence that defies a unique framework for sparse regression models in the very rich class of IRLS models.

...read moreread less

Abstract: In the last few years, the support vector machine (SVM) method has motivated new interest in kernel regression techniques. Although the SVM has been shown to exhibit excellent generalization properties in many experiments, it suffers from several drawbacks, both of a theoretical and a technical nature: the absence of probabilistic outputs, the restriction to Mercer kernels, and the steep growth of the number of support vectors with increasing size of the training set. In this paper, we present a different class of kernel regressors that effectively overcome the above problems. We call this approach generalized LASSO regression. It has a clear probabilistic interpretation, can handle learning sets that are corrupted by outliers, produces extremely sparse solutions, and is capable of dealing with large-scale problems. For regression functionals which can be modeled as iteratively reweighted least-squares (IRLS) problems, we present a highly efficient algorithm with guaranteed global convergence. This defies a unique framework for sparse regression models in the very rich class of IRLS models, including various types of robust regression models and logistic regression. Performance studies for many standard benchmark datasets effectively demonstrate the advantages of this model over related approaches.

...read moreread less

281 citations

Journal Article•DOI•

A Robust Measure of Skewness

[...]

Guy Brys¹, Mia Hubert¹, Anja Struyf¹•Institutions (1)

University of Antwerp¹

01 Jan 2004-Journal of Computational and Graphical Statistics

TL;DR: The Medcouple as mentioned in this paper is a robust alternative to the classical skewness coefficient, which has a 25% breakdown value and a bounded influence function, and it has a fast algorithm for its computation, and investigate its finite sample behavior through simulated and real datasets.

...read moreread less

Abstract: The asymmetry of a univariate continuous distribution is commonly measured by the classical skewness coefficient. Because this estimator is based on the first three moments of the dataset, it is strongly affected by the presence of one or more outliers. This article investigates the medcouple, a robust alternative to the classical skewness coefficient. We show that it has a 25% breakdown value and a bounded influence function. We present a fast algorithm for its computation, and investigate its finite-sample behavior through simulated and real datasets.

...read moreread less

265 citations

Journal Article•DOI•

[...]

Miin-Shen Yang¹, Kuo-Lung Wu²•Institutions (2)

Chung Yuan Christian University¹, Kun Shan University²

01 Apr 2004-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is shown that the data points in SCM can self-organize local optimal cluster number and volumes without using cluster validity functions or a variance-covariance matrix and is robust to noise and outliers.

...read moreread less

Abstract: This paper presents an alternating optimization clustering procedure called a similarity-based clustering method (SCM). It is an effective and robust approach to clustering on the basis of a total similarity objective function related to the approximate density shape estimation. We show that the data points in SCM can self-organize local optimal cluster number and volumes without using cluster validity functions or a variance-covariance matrix. The proposed clustering method is also robust to noise and outliers based on the influence function and gross error sensitivity analysis. Therefore, SCM exhibits three robust clustering characteristics: 1) robust to the initialization (cluster number and initial guesses), 2) robust to cluster volumes (ability to detect different volumes of clusters), and 3) robust to noise and outliers. Several numerical data sets and actual data are used in the SCM to show these good aspects. The computational complexity of SCM is also analyzed. Some experimental results of comparing the proposed SCM with the existing methods show the superiority of the SCM method.

...read moreread less

211 citations

Journal Article•DOI•

A simple more general boxplot method for identifying outliers

[...]

Neil C. Schwertman¹, Margaret Ann Owens¹, Robiah Adnan²•Institutions (2)

California State University, Chico¹, Universiti Teknologi Malaysia²

01 Aug 2004-Computational Statistics & Data Analysis

TL;DR: A method for determining the probability associated with any fence or observation is proposed based on the cumulative distribution function of the order statistics, which allows the statistician to easily assess the degree to which an observation is dissimilar to the majority of the observations.

...read moreread less

201 citations

Journal Article•DOI•

Outlier Detection in the Multiple Cluster Setting Using the Minimum Covariance Determinant Estimator

[...]

Johanna Hardin¹, David M. Rocke²•Institutions (2)

Pomona College¹, University of California, Davis²

28 Jan 2004-Computational Statistics & Data Analysis

TL;DR: Malanobis-type distances in which the shape matrix is derived from a consistent high-breakdown robust multivariate location and scale estimator can be used to find outlying points in a robust clustering method in conjunction with an outlier identification method.

...read moreread less

193 citations

Journal Article•DOI•

The loss function of sensorimotor learning

[...]

Konrad P. Kording¹, Daniel M. Wolpert¹•Institutions (1)

University College London¹

29 Jun 2004-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: It is shown that people use a loss function in which the cost increases approximately quadratic with error for small errors and significantly less than quadratically for large errors, suggesting that models of sensorimotor control and learning that have assumed minimizing squared error are a good approximation but tend to penalize large errors excessively.

...read moreread less

Abstract: Motor learning can be defined as changing performance so as to optimize some function of the task, such as accuracy. The measure of accuracy that is optimized is called a loss function and specifies how the CNS rates the relative success or cost of a particular movement outcome. Models of pointing in sensorimotor control and learning usually assume a quadratic loss function in which the mean squared error is minimized. Here we develop a technique for measuring the loss associated with errors. Subjects were required to perform a task while we experimentally controlled the skewness of the distribution of errors they experienced. Based on the change in the subjects' average performance, we infer the loss function. We show that people use a loss function in which the cost increases approximately quadratically with error for small errors and significantly less than quadratically for large errors. The system is thus robust to outliers. This suggests that models of sensorimotor control and learning that have assumed minimizing squared error are a good approximation but tend to penalize large errors excessively.

...read moreread less

Journal Article•DOI•

Intensity-based image registration using robust correlation coefficients

[...]

Jeongtae Kim, Jeffrey A. Fessler¹•Institutions (1)

University of Michigan¹

01 Nov 2004-IEEE Transactions on Medical Imaging

TL;DR: An intensity-based image registration technique that uses a robust correlation coefficient as a similarity measure that reduces the influence of outliers and should be useful for image registration in radiotherapy and image-guided surgery applications.

...read moreread less

Abstract: The ordinary sample correlation coefficient is a popular similarity measure for aligning images from the same or similar modalities. However, this measure can be sensitive to the presence of "outlier" objects that appear in one image but not the other, such as surgical instruments, the patient table, etc., which can lead to biased registrations. This paper describes an intensity-based image registration technique that uses a robust correlation coefficient as a similarity measure. Relative to the ordinary sample correlation coefficient, the proposed similarity measure reduces the influence of outliers. We also compared the performance of the proposed method with the mutual information-based method. The robust correlation-based method should be useful for image registration in radiotherapy (KeV to MeV X-ray images) and image-guided surgery applications. We have investigated the properties of the proposed method by theoretical analysis, computer simulations, a phantom experiment, and with functional magnetic resonance imaging (MRI) data.

...read moreread less

Journal Article•DOI•

Robust adaptive-scale parametric model estimation for computer vision

[...]

Hanzi Wang¹, David Suter¹•Institutions (1)

Monash University, Clayton campus¹

01 Nov 2004-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: ASSC can simultaneously estimate the parameters of a model and the scale of the inliers belonging to that model and this work proposes two novel robust techniques: the two-step scale estimator (TSSE) and the adaptive scale sample consensus (ASSC) estimator.

...read moreread less

Abstract: Robust model fitting essentially requires the application of two estimators. The first is an estimator for the values of the model parameters. The second is an estimator for the scale of the noise in the (inlier) data. Indeed, we propose two novel robust techniques: the two-step scale estimator (TSSE) and the adaptive scale sample consensus (ASSC) estimator. TSSE applies nonparametric density estimation and density gradient estimation techniques, to robustly estimate the scale of the inliers. The ASSC estimator combines random sample consensus (RANSAC) and TSSE, using a modified objective function that depends upon both the number of inliers and the corresponding scale. ASSC is very robust to discontinuous signals and data with multiple structures, being able to tolerate more than 80 percent outliers. The main advantage of ASSC over RANSAC is that prior knowledge about the scale of inliers is not needed. ASSC can simultaneously estimate the parameters of a model and the scale of the inliers belonging to that model. Experiments on synthetic data show that ASSC has better robustness to heavily corrupted data than least median squares (LMedS), residual consensus (RESC), and adaptive least Kth order squares (ALKS). We also apply ASSC to two fundamental computer vision tasks: range image segmentation and robust fundamental matrix estimation. Experiments show very promising results.

...read moreread less

Proceedings Article•DOI•

On local spatial outliers

[...]

Pei Sun¹, Sanjay Chawla¹•Institutions (1)

University of Sydney¹

01 Nov 2004

TL;DR: A measure, spatial local outlier measure (SLOM), which captures the local behaviour of datum in their spatial neighborhood and takes into account the local stability around a data point and supresses the reporting of outliers in highly unstable areas.

...read moreread less

Abstract: We propose a measure, spatial local outlier measure (SLOM) which captures the local behaviour of datum in their spatial neighborhood. With the help of SLOM, we are able to discern local spatial outliers which are usually missed by global techniques like "three standard deviations away from the mean". Furthermore, the measure takes into account the local stability around a data point and supresses the reporting of outliers in highly unstable areas, where data is too heterogeneous and the notion of outliers is not meaningful. We prove several properties of SLOM and report experiments on synthetic and real data sets which show that our approach is scalable to large data sets.

...read moreread less

Book Chapter•DOI•

A Family of Geographically Weighted Regression Models

[...]

James P. LeSage¹•Institutions (1)

University of Toledo¹

01 Jan 2004-Research Papers in Economics

TL;DR: A Bayesian approach to locally linear regression methods introduced in McMillen (1996) and labeled geographically weighted regressions (GWR) in Brunsdon et al. as discussed by the authors is set forth in this chapter.

...read moreread less

Abstract: A Bayesian approach to locally linear regression methods introduced in McMillen (1996) and labeled geographically weighted regressions (GWR) in Brunsdon et al. (1996) is set forth in this chapter. The main contribution of the GWR methodology is use of distance weighted sub-samples of the data to produce locally linear regression estimates for every point in space. Each set of parameter estimates is based on a distance-weighted sub-sample of “neighboring observations,” which has a great deal of intuitive appeal in spatial econometrics. While this approach has a definite appeal, it also presents some problems. The Bayesian method introduced here can resolve some difficulties that arise in GWR models when the sample observations contain outliers or non-constant variance.

...read moreread less

Journal Article•DOI•

Principal component analysis applied to Fourier transform infrared spectroscopy for the design of calibration sets for glycerol prediction models in wine and for the detection and classification of outlier samples.

[...]

Helene Nieuwoudt¹, Bernard A. Prior², Isak S. Pretorius², Marena Manley², Florian Bauer² - Show less +1 more•Institutions (2)

Stellenbosch University¹, Australian Wine Research Institute²

16 Jun 2004-Journal of Agricultural and Food Chemistry

TL;DR: This study yielded an analytical strategy that combined the careful design of calibration sets with measures that facilitated the early detection and interpretation of poorly predicted samples and outlier samples in a sample set.

...read moreread less

Abstract: Principal component analysis (PCA) was used to identify the main sources of variation in the Fourier transform infrared (FT-IR) spectra of 329 wines of various styles. The FT-IR spectra were gathered using a specialized WineScan instrument. The main sources of variation included the reducing sugar and alcohol content of the samples, as well as the stage of fermentation and the maturation period of the wines. The implications of the variation between the different wine styles for the design of calibration models with accurate predictive abilities were investigated using glycerol calibration in wine as a model system. PCA enabled the identification and interpretation of samples that were poorly predicted by the calibration models, as well as the detection of individual samples in the sample set that had atypical spectra (i.e., outlier samples). The Soft Independent Modeling of Class Analogy (SIMCA) approach was used to establish a model for the classification of the outlier samples. A glycerol calibration for wine was developed (reducing sugar content 8% v/v) with satisfactory predictive ability (SEP = 0.40 g/L). The RPD value (ratio of the standard deviation of the data to the standard error of prediction) was 5.6, indicating that the calibration is suitable for quantification purposes. A calibration for glycerol in special late harvest and noble late harvest wines (RS 31-147 g/L, alcohol > 11.6% v/v) with a prediction error SECV = 0.65 g/L, was also established. This study yielded an analytical strategy that combined the careful design of calibration sets with measures that facilitated the early detection and interpretation of poorly predicted samples and outlier samples in a sample set. The strategy provided a powerful means of quality control, which is necessary for the generation of accurate prediction data and therefore for the successful implementation of FT-IR in the routine analytical laboratory.

...read moreread less

Journal Article•DOI•

Robust growing neural gas algorithm with application in cluster analysis

[...]

A. K. Qin¹, Ponnuthurai Nagaratnam Suganthan¹•Institutions (1)

Nanyang Technological University¹

01 Oct 2004-Neural Networks

TL;DR: Experimental results have shown the superior performance of the proposed RGNG method over the original GNG incorporating MDL method, called GNG-M, in static data clustering tasks on both artificial and UCI data sets.

...read moreread less

Journal Article•DOI•

Breakdown points for maximum likelihood estimators of location–scale mixtures

[...]

Christian Hennig

01 Aug 2004-Annals of Statistics

TL;DR: It turns out that the two alternatives, while adding stability in the presence of outliers of moderate size, do not possess a substantially better breakdown behavior than estimation based on Normal mixtures.

...read moreread less

Abstract: ML-estimation based on mixtures of Normal distributions is a widely used tool for cluster analysis. However, a single outlier can make the parameter estimation of at least one of the mixture components break down. Among others, the estimation of mixtures of t-distributions by McLachlan and Peel [Finite Mixture Models (2000) Wiley, New York] and the addition of a further mixture component accounting for ?noise? by Fraley and Raftery [The Computer J. 41 (1998) 578?588] were suggested as more robust alternatives. In this paper, the definition of an adequate robustness measure for cluster analysis is discussed and bounds for the breakdown points of the mentioned methods are given. It turns out that the two alternatives, while adding stability in the presence of outliers of moderate size, do not possess a substantially better breakdown behavior than estimation based on Normal mixtures. If the number of clusters s is treated as fixed, r additional points suffice for all three methods to let the parameters of r clusters explode. Only in the case of r=s is this not possible for t-mixtures. The ability to estimate the number of mixture components, for example, by use of the Bayesian information criterion of Schwarz [Ann. Statist. 6 (1978) 461?464], and to isolate gross outliers as clusters of one point, is crucial for an improved breakdown behavior of all three techniques. Furthermore, a mixture of Normals with an improper uniform distribution is proposed to achieve more robustness in the case of a fixed number of components.

...read moreread less

A multivariate outlier detection method

[...]

Peter Filzmoser¹•Institutions (1)

Vienna University of Technology¹

01 Jan 2004

TL;DR: A method for the detection of multivariate outliers is proposed which accounts for the data structure and sample size and defines the cut-off value by a measure of deviation of the empirical distribution function of the robust Mahalanobis distance from the theoretical distribution function.

...read moreread less

Abstract: A method for the detection of multivariate outliers is proposed which accounts for the data structure and sample size. The cut-off value for identifying outliers is defined by a measure of deviation of the empirical distribution function of the robust Mahalanobis distance from the theoretical distribution function. The method is easy to implement and fast to compute.

...read moreread less

Robust standard errors for robust estimators

[...]

Christophe Croux¹, Geert Dhaene, Dirk Hoorelbeke•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2004

TL;DR: This paper shows how robust standard errors can be computed for several robust estimators of regression, including MMestimators, and presents a test of the hypothesis that the robust and non-robust standard errors have the same probability limit.

...read moreread less

Abstract: A regression estimator is said to be robust if it is still reliable in the presence of outliers. On the other hand, its standard error is said to be robust if it is still reliable when the regression errors are autocorrelated and/or heteroskedastic. This paper shows how robust standard errors can be computed for several robust estimators of regression, including MMestimators. The improvement relative to non-robust standard errors is illustrated by means of large-sample bias calculations, simulations, and a real data example. It turns out that non-robust standard errors of robust estimators may be severely biased. However, if autocorrelation and heteroscedasticity are absent, non-robust standard errors are more efficient than the robust standard errors that we propose. We therefore also present a test of the hypothesis that the robust and non-robust standard errors have the same probability limit.

...read moreread less

Journal Article•DOI•

Robust principal component analysis and outlier detection with ecological data

[...]

Donald A. Jackson¹, Yong Chen²•Institutions (2)

University of Toronto¹, University of Maine²

01 Mar 2004-Environmetrics

TL;DR: This study provides a comparison of a standard method,based on the Mahalanobis distance, used in multivariate approaches to a robust method based on the minimum volume ellipsoid as a means of determining whether data sets contain outliers or not, and suggests that ecologists consider that their data may contain atypical points.

...read moreread less

Abstract: Ecological studies frequently involve large numbers of variables and observations, and these are often subject to various errors. If some data are not representative of the study population, they tend to bias the interpretation and conclusion of an ecological study. Because of the multivariate nature of ecological data, it is very difficult to identify atypical observations using approaches such as univariate or bivariate plots. This difficulty calls for the application of robust statistical methods in identifying atypical observations. Our study provides a comparison of a standard method, based on the Mahalanobis distance, used in multivariate approaches to a robust method based on the minimum volume ellipsoid as a means of determining whether data sets contain outliers or not. We evaluate both methods using simulations varying conditions of the data, and show that the minimum volume ellipsoid approach is superior in detecting outliers where present. We show that, as the sample size parameter, h, used in the robust approach increases in value, there is a decrease in the accuracy and precision of the associated estimate of the number of outliers present, in particular as the number of outliers increases. Conversely, where no outliers are present, large values for the parameter provide the most accurate results. In addition to the simulation results, we demonstrate the use of the robust principal component analysis with a data set of lake-water chemistry variables to illustrate the additional insight available. We suggest that ecologists consider that their data may contain atypical points. Following checks associated with normality, bivariate linearity and other traditional aspects, we advocate that ecologists examine their data sets using robust multivariate methods. Points identified as being atypical should be carefully evaluated based on background information to determine their suitability for inclusion in further multivariate analyses and whether additional factors explain their unusual characteristics. Copyright © 2004 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•DOI•

Mining class outliers: concepts, algorithms and applications in CRM

[...]

Zengyou He¹, Xiaofei Xu¹, Joshua Zhexue Huang², Shengchun Deng¹•Institutions (2)

Harbin Institute of Technology¹, University of Hong Kong²

01 Nov 2004-Expert Systems With Applications

TL;DR: The notion of class outlier is developed and proposed practical solutions by extending existing outlier detection algorithms to this case are proposed and its potential applications in CRM (customer relationship management) are also discussed.

...read moreread less

Abstract: Outliers, or commonly referred to as exceptional cases, exist in many real-world databases. Detection of such outliers is important for many applications and has attracted much attention from the data mining research community recently. However, most existing methods are designed for mining outliers from a single dataset without considering the class labels of data objects. In this paper, we consider the class outlier detection problem ‘given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels’. By generalizing two pioneer contributions [Proc WAIM02 (2002); Proc SSTD03] in this field, we develop the notion of class outlier and propose practical solutions by extending existing outlier detection algorithms to this case. Furthermore, its potential applications in CRM (customer relationship management) are also discussed. Finally, the experiments in real datasets show that our method can find interesting outliers and is of practical use.

...read moreread less

Proceedings Article•DOI•

LOADED: link-based outlier and anomaly detection in evolving data sets

[...]

Amol Ghoting¹, Matthew Eric Otey¹, Srinivasan Parthasarathy¹•Institutions (1)

Ohio State University¹

01 Nov 2004

TL;DR: Experimental results show that LOADED provides very good detection and false positive rates, which are several times better than those of existing distance-based schemes.

...read moreread less

Abstract: In this paper, we present LOADED, an algorithm for outlier detection in evolving data sets containing both continuous and categorical attributes. LOADED is a tunable algorithm, wherein one can trade off computation for accuracy so that domain-specific response times are achieved. Experimental results show that LOADED provides very good detection and false positive rates, which are several times better than those of existing distance-based schemes.

...read moreread less

Book•

Environmental Statistics: Methods and Applications

[...]

Vic Barnett

16 Jan 2004

TL;DR: In this paper, the authors present an approach to estimate the mean and variance of a sample set using a simple random sample mean and the stratified sample mean, respectively, and compare the results with the results obtained using a normalized Pareto distribution.

...read moreread less

Abstract: Preface. Chapter 1: Introduction. 1.1 Tomorrow is too Late! 1.2 Environmental Statistics. 1.3 Some Examples. 1.3.1 'Getting it all together'. 1.3.2 'In time and space'. 1.3.3 'Keep it simple'. 1.3.4 'How much can we take?' 1.3.5 'Over the top'. 1.4 Fundamentals. 1.5 Bibliography. PART I: EXTREMAL STRESSES: EXTREMES, OUTLIERS, ROBUSTNESS. Chapter 2: Ordering and Extremes: Applications, models, inference. 2.1 Ordering the Sample. 2.1.1 Order statistics. 2.2 Order-based Inference. 2.3 Extremes and Extremal Processes. 2.3.1 Practical study and empirical models generalized extreme-value distributions. 2.4 Peaks over Thresholds and the Generalized Pareto Distribution. Chapter 3: Outliers and Robustness. 3.1 What is an Outlier? 3.2 Outlier Aims and Objectives. 3.3 Outlier-Generating Models. 3.3.1 Discordancy and models for outlier generation. 3.3.2 Tests of discordancy for specific distributions. 3.4 Multiple Outliers: Masking and Swamping. 3.5 Accommodation: Outlier-Robust Methods. 3.6 A Possible New Approach to Outliers. 3.7 Multivariate Outliers. 3.8 Detecting Multivariate Outliers. 3.8.1 Principles. 3.8.2 Informal methods. 3.9 Tests of Discordancy. 3.10 Accommodation. 3.11 Outliers in linear models. 3.12 Robustness in General. PART II: COLLECTING ENVIRONMENTAL DATA: SAMPLING AND MONITORING. Chapter 4: Finite-Population Sampling. 4.1 A Probabilistic Sampling Scheme. 4.2 Simple Random Sampling. 4.2.1 Estimating the mean, X. 4.2.2 Estimating the variance, S2. 4.2.3 Choice of sample size, n. 4.2.4 Estimating the population total, XT. 4.2.5 Estimating a proportion, P. 4.3 Ratios and Ratio Estimators. 4.3.1 The estimation of a ratio. 4.3.2 Ratio estimator of a population total or mean. 4.4 Stratified (simple) Random Sampling. 4.4.1 Comparing the simple random sample mean and the stratified sample mean. 4.4.2 Choice of sample sizes. 4.4.3 Comparison of proportional allocation and optimum allocation. 4.4.4 Optimum allocation for estimating proportions. 4.5 Developments of Survey Sampling. Chapter 5: Inaccessible and Sensitive Data. 5.1 Encountered Data. 5.2 Length-Biased or Size-Biased Sampling and Weighted Distributions. 5.2.1 Weighted distribution methods. 5.3 Composite Sampling. 5.3.1 Attribute Sampling. 5.3.2 Continuous variables. 5.3.3 Estimating mean and variance. 5.4 Ranked-Set Sampling. 5.4.1 The ranked-set sample mean. 5.4.2 Optimal estimation. 5.4.3 Ranked-set sampling for normal and exponential distributions. 5.4.4 Imperfect ordering. Chapter 6: Sampling in the Wild. 6.1 Quadrat Sampling. 6.2 Recapture Sampling. 6.2.1 The Petersen and Chapman estimators. 6.2.2 Capture-recapture methods in open populations. 6.3 Transect Sampling. 6.3.1 The simplest case: strip transects. 6.3.2 Using a detectability function. 6.3.3 Estimating f (y). 6.3.4 Modifications of approach. 6.3.5 Point transects or variable circular plots. 6.4 Adaptive Sampling. 6.4.1 Simple models for adaptive sampling. Part III: EXAMINING ENVIRONMENTAL EFFECTS: STIMULUS-RESPONSE RELATIONSHIPS. Chapter 7: Relationship: regression-type models and methods. 7.1 Linear Models. 7.1.1 The linear model. 7.1.2 The extended linear model. 7.1.3 The normal linear model. 7.2 Transformations. 7.2.1 Looking at the data. 7.2.2 Simple transformations. 7.2.3 General transformations. 7.3 The Generalized Linear Model. Chapter 8: Special Relationship Models, Including Quantal Response and Repeated Measures. 8.1 Toxicology Concerns. 8.2 Quantal Response. 8.3 Bioassay. 8.4 Repeated Measures. Part IV: STANDARDS AND REGULATIONS. Chapter 9: Environmental Standards. 9.1 Introduction. 9.2 The Statistically Verifiable Ideal Standard. 9.2.1 Other sampling methods. 9.3 Guard Point Standards. 9.4 Standards Along the Cause-Effect Chain. Part V: A MANY-DIMENSIONAL ENVIRONMENT: SPATIAL AND TEMPORAL PROCESSES. Chapter 10: Time-Series Methods. 10.1 Space and Time Effects. 10.2 Time Series. 10.3 Basic Issues. 10.4 Descriptive Methods. 10.4.1 Estimating or eliminating trend. 10.4.2 Periodicities. 10.4.3 Stationary time series. 10.5 Time-Domain Models and Methods. 10.6 Frequency-Domain Models and Methods. 10.6.1 Properties of the spectral representation. 10.6.2 Outliers in time series. 10.7 Point Processes. 10.7.1 The Poisson process. 10.7.2 Other point processes. Chapter 11: Spatial Methods for Environmental Processes. 11.1 Spatial Point Process Models and Methods. 11.2 The General Spatial Process. 11.2.1 Predication, interpolation and kriging. 11.2.2 Estimation of the variogram. 11.2.3 Other forms of kriging. 11.3 More about Standards Over Space and Time. 11.4 Relationship. 11.5 More about Spatial Models. 11.5.1 Types of spatial model. 11.5.2 Harmonic analysis of spatial processes. 11.6 Spatial Sampling and Spatial Design. 11.6.1 Spatial sampling. 11.6.2 Spatial design. 11.7 Spatial-Temporal Models and Methods. References. Index.

...read moreread less

Journal Article•DOI•

Variable threshold outlier identification in PIV data

[...]

A.-M. Shinneeb¹, J. D. Bugg¹, Ram Balachandar²•Institutions (2)

University of Saskatchewan¹, University of Windsor²

01 Sep 2004-Measurement Science and Technology

TL;DR: A variable threshold technique that can be applied to any particle image velocimetry (PIV) post-analysis outlier identification algorithm which uses a threshold such as the local median or the cellular neural network techniques which is found to be much less susceptible to erroneously rejecting good vectors.

...read moreread less

Abstract: This paper describes a variable threshold technique that can be applied to any particle image velocimetry (PIV) post-analysis outlier identification algorithm which uses a threshold such as the local median or the cellular neural network techniques. Although these techniques have been shown to work quite well with constant thresholds, the selection of the threshold is not always clear when working with real data. Moreover, if a small threshold is selected, a very large number of valid vectors can be mistakenly rejected. Although careful monitoring may alleviate this danger in many cases, that is not always practical when large data sets are being analysed and there is significant variability in the properties of the vector fields. The method described in this paper adjusts the threshold by calculating a mean variation between a candidate vector and its eight neighbours. The main benefit is that much smaller thresholds can be used without suffering catastrophic loss of valid vectors. The main challenge in obtaining this threshold field is that it must be based on a filtered field to be representative of the underlying velocity field. In this work, a simple median filter which requires no threshold was used for preliminary rejection. A local threshold was then calculated from the mean difference between each vector and its neighbours. The threshold field was also filtered with a Gaussian kernel before use. The algorithm was tested and compared to the base techniques by generating artificial velocity fields with known numbers of spurious vectors. For these tests, the ability of the algorithms to identify bad vectors and preserve good vectors was monitored. In addition, the technique was tested on real PIV data from the developing region of an axisymmetric jet. The variable threshold versions of these algorithms were found to be much less susceptible to erroneously rejecting good vectors. This is because the variable threshold techniques extract information about the local velocity gradient from the data themselves. The user-adjustable parameters for the variable threshold methods were found to be more universal than the constant threshold methods.

...read moreread less

Proceedings Article•DOI•

Privacy-preserving outlier detection

[...]

Jaideep Vaidya¹, Chris Clifton¹•Institutions (1)

Rutgers University¹

01 Nov 2004

TL;DR: This work looks at the problem of finding outliers in large distributed databases where privacy/security concerns restrict the sharing of data, and proposes techniques to detect outlier in such scenarios while giving formal guarantees on the amount of information disclosed.

...read moreread less

Abstract: Outlier detection can lead to the discovery of truly unexpected knowledge in many areas such as electronic commerce, credit card fraud and especially national security. We look at the problem of finding outliers in large distributed databases where privacy/security concerns restrict the sharing of data. Both homogeneous and heterogeneous distribution of data is considered. We propose techniques to detect outliers in such scenarios while giving formal guarantees on the amount of information disclosed.

...read moreread less

Journal Article•DOI•

Structural damage diagnosis by Kalman model based on stochastic subspace identification

[...]

Ai-Min Yan¹, Pascal De Boe¹, Jean-Claude Golinval¹•Institutions (1)

University of Liège¹

01 Jun 2004-Structural Health Monitoring-an International Journal

TL;DR: In this article, a Kalman model is constructed by performing a stochastic subspace identification to fit the measured response histories of the undamaged (reference) structure, which will not be able to reproduce the newly measured responses when damage occurs.

...read moreread less

Abstract: This paper presents an application of statistical process control techniques for damage diagnosis using vibration measurements. A Kalman model is constructed by performing a stochastic subspace identification to fit the measured response histories of the undamaged (reference) structure. It will not be able to reproduce the newly measured responses when damage occurs. The residual error of the prediction by the identified model with respect to the actual measurement of signals is defined as a damage-sensitive feature. The outlier statistics provides a quantitative indicator of damage. The advantage of the method is that model extraction is performed by using only the reference data and that no further modal identification is needed. On-line health monitoring of structures is therefore easily realized. When the structure consists of the assembly of several sub-structures, for which the dynamic interaction is weak, the damage may be located as the errors attain the maximum at the sensors instrumented in the ...

...read moreread less

Journal Article•DOI•

Genetic algorithms for outlier detection and variable selection in linear regression models

[...]

Jussi Tolvi¹•Institutions (1)

University of Turku¹

01 Aug 2004

TL;DR: A genetic algorithm is presented which considers different possible groupings of the data into outlier and non-outlier observations, and a genetic algorithm for a simultaneous outlier detection and variable selection is suggested.

...read moreread less

Abstract: This article addresses some problems in outlier detection and variable selection in linear regression models. First, in outlier detection there are problems known as smearing and masking. Smearing means that one outlier makes another, non-outlier observation appear as an outlier, and masking that one outlier prevents another one from being detected. Detecting outliers one by one may therefore give misleading results. In this article a genetic algorithm is presented which considers different possible groupings of the data into outlier and non-outlier observations. In this way all outliers are detected at the same time. Second, it is known that outlier detection and variable selection can influence each other, and that different results may be obtained, depending on the order in which these two tasks are performed. It may therefore be useful to consider these tasks simultaneously, and a genetic algorithm for a simultaneous outlier detection and variable selection is suggested. Two real data sets are used to illustrate the algorithms, which are shown to work well. In addition, the scalability of the algorithms is considered with an experiment using generated data.

...read moreread less

Journal Article•DOI•

On accounting classification and the international harmonisation debate

[...]

Christopher Nobes¹•Institutions (1)

University of Reading¹

01 Feb 2004-Accounting Organizations and Society

TL;DR: In an earlier issue of this journal, d'Arcy presented a classification of accounting systems that showed the UK and the US in separate groups and Australia as an outlier as discussed by the authors.

...read moreread less

Abstract: In an earlier issue of this journal, d'Arcy presents a classification of accounting systems that shows the UK and the US in separate groups and Australia as an outlier. It is suggested here that the classification is unsound because the data were unsuitable in nature and contained errors.

...read moreread less

Journal Article•DOI•

Smooth Transition Exponential Smoothing

[...]

James Taylor¹•Institutions (1)

University of Oxford¹

01 Sep 2004-Journal of Forecasting

TL;DR: In this paper, a logistic function of a user-specified variable is used to model the time-varying parameter in smooth transition models, which has the potential to outperform existing adaptive methods and constant parameter methods.

...read moreread less

Abstract: Adaptive exponential smoothing methods allow a smoothing parameter to change over time, in order to adapt to changes in the characteristics of the time series. However, these methods have tended to produce unstable forecasts and have performed poorly in empirical studies. This paper presents a new adaptive method, which enables a smoothing parameter to be modelled as a logistic function of a user-specified variable. The approach is analogous to that used to model the time-varying parameter in smooth transition models. Using simulated data, we show that the new approach has the potential to outperform existing adaptive methods and constant parameter methods when the estimation and evaluation samples both contain a level shift or both contain an outlier. An empirical study, using the monthly time series from the M3-Competition, gave encouraging results for the new approach.

...read moreread less

Collapse