Showing papers on "Outlier published in 2001"

PDF

Open Access

Proceedings Article•DOI•

Outlier detection for high dimensional data

[...]

Charu C. Aggarwal¹, Philip S. Yu¹•Institutions (1)

01 May 2001

TL;DR: New techniques for outlier detection which find the outliers by studying the behavior of projections from the data set are discussed.

...read moreread less

Abstract: The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions. Consequently, for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non-obvious. In this paper, we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set.

...read moreread less

1,132 citations

Proceedings Article•DOI•

Robust principal component analysis for computer vision

[...]

F. De la Torre¹, Michael J. Black²•Institutions (2)

Ramon Llull University¹, Brown University²

07 Jul 2001

TL;DR: The theory of Robust Principal Component Analysis is developed and a robust M-estimation algorithm is described for learning linear multi-variate representations of high dimensional data such as images, which illustrates the benefits of RPCA when outliers are present.

...read moreread less

Abstract: Principal Component Analysis (PCA) has been widely used for the representation of shape, appearance and motion. One drawback of typical PCA methods is that they are least squares estimation techniques and hence fail to account for "outliers" which are common in realistic training sets. In computer vision applications, outliers typically occur within a sample (image) due to pixels that are corrupted by noise, alignment errors, or occlusion. We review previous approaches for making PCA robust to outliers and present a new method that uses an intra-sample outlier process to account for pixel outliers. We develop the theory of Robust Principal Component Analysis (RPCA) and describe a robust M-estimation algorithm for learning linear multi-variate representations of high dimensional data such as images. Quantitative comparisons with traditional PCA and previous robust algorithms illustrate the benefits of RPCA when outliers are present. Details of the algorithm are described and a software implementation is being made publically available.

...read moreread less

378 citations

Proceedings Article•DOI•

Mining top-n local outliers in large databases

[...]

Wen Jin¹, Anthony K. H. Tung¹, Jiawei Han¹•Institutions (1)

Simon Fraser University¹

26 Aug 2001

TL;DR: A novel method to efficiently find the top-n local outliers in large databases using the concept of "micro-cluster" is proposed to compress the data.

...read moreread less

Abstract: Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF) which represents the likelihood of that object being an outlier. Although the concept of local outliers is a useful one, the computation of LOF values for every data objects requires a large number of k-nearest neighbors searches and can be computationally expensive. Since most objects are usually not outliers, it is useful to provide users with the option of finding only n most outstanding local outliers, i.e., the top-n data objects which are most likely to be local outliers according to their LOFs. However, if the pruning is not done carefully, finding top-n outliers could result in the same amount of computation as finding LOF for all objects. In this paper, we propose a novel method to efficiently find the top-n local outliers in large databases. The concept of "micro-cluster" is introduced to compress the data. An efficient micro-cluster-based local outlier mining algorithm is designed based on this concept. As our algorithm can be adversely affected by the overlapping in the micro-clusters, we proposed a meaningful cut-plane solution for overlapping data. The formal analysis and experiments show that this method can achieve good performance in finding the most outstanding local outliers.

...read moreread less

356 citations

Journal Article•DOI•

Efficient Algorithms for Robust Estimation in Linear Mixed-Effects Models Using the Multivariate t Distribution

[...]

José C. Pinheiro, Chuanhai Liu, Ying Nian Wu

01 Jun 2001-Journal of Computational and Graphical Statistics

TL;DR: The authors proposed a hierarchical linear mixed-effects model for orthodontic data, in which the random effects and the within-subject errors have multivariate t-distributions with known or unknown degrees of freedom.

...read moreread less

Abstract: Linear mixed-effects models are frequently used to analyze repeated measures data, because they model flexibly the within-subject correlation often present in this type of data. The most popular linear mixed-effects model for a continuous response assumes normal distributions for the random effects and the within-subject errors, making it sensitive to outliers. Such outliers are more problematic for mixed-effects models than for fixed-effects models, because they may occur in the random effects, in the within-subject errors, or in both, making them harder to be detected in practice. Motivated by a real dataset from an orthodontic study, we propose a robust hierarchical linear mixed-effects model in which the random effects and the within-subject errors have multivariate t-distributions, with known or unknown degrees-of-freedom, which are allowed to vary with groups of subjects. By using a gamma-normal hierarchical structure, our model allows the identification and classification of both types of outliers,...

...read moreread less

301 citations

Journal Article•DOI•

Interpretation of Variation Across Marker Loci as Evidence of Selection

[...]

Renaud Vitalis¹, Kevin J. Dawson¹, Kevin J. Dawson², Pierre Boursot¹•Institutions (2)

University of Montpellier¹, University of Bristol²

01 Aug 2001-Genetics

TL;DR: It is found that outlier loci are easier to recognize when this joint distribution is conditioned on the total number of allelic states represented in the pooled sample at each locus, and the conditional distribution is less sensitive to the values of nuisance parameters.

...read moreread less

Abstract: Population structure and history have similar effects on the genetic diversity at all neutral loci. However, some marker loci may also have been strongly influenced by natural selection. Selection shapes genetic diversity in a locus-specific manner. If we could identify those loci that have responded to selection during the divergence of populations, then we may obtain better estimates of the parameters of population history by excluding these loci. Previous attempts were made to identify outlier loci from the distribution of sample statistics under neutral models of population structure and history. Unfortunately these methods depend on assumptions about population structure and history that usually cannot be verified. In this article, we define new population-specific parameters of population divergence and construct sample statistics that are estimators of these parameters. We then use the joint distribution of these estimators to identify outlier loci that may be subject to selection. We found that outlier loci are easier to recognize when this joint distribution is conditioned on the total number of allelic states represented in the pooled sample at each locus. This is so because the conditional distribution is less sensitive to the values of nuisance parameters.

...read moreread less

285 citations

Journal Article•DOI•

Effect of Outliers and Nonhealthy Individuals on Reference Interval Estimation

[...]

Paul S. Horn¹, Lan Feng², Yanmei Li, Amadeo J. Pesce¹•Institutions (2)

University of Cincinnati¹, Mount Carmel Health²

01 Dec 2001-Clinical Chemistry

TL;DR: Combining traditional and robust statistical techniques provide a good method of identifying outliers in a reference interval setting, even in healthy samples, and there is a large deviation among analytes.

...read moreread less

Abstract: Background: Improvement in reference interval estimation using a new outlier detection technique, even with a physician-determined healthy sample, is examined. The effect of including physician-determined nonhealthy individuals in the sample is evaluated. Methods: Traditional data transformation coupled with robust and exploratory outlier detection methodology were used in conjunction with various reference interval determination techniques. A simulation study was used to examine the effects of outliers on known reference intervals. Physician-defined healthy groups with and without nonhealthy individuals were compared on real data. Results: With 5% outliers in simulated samples, the described outlier detection techniques had narrower reference intervals. Application of the technique to real data provided reference intervals that were, on average, 10% narrower than those obtained when outlier detection was not used. Only 1.6% of the samples were identified as outliers and removed from reference interval determination in both the healthy and combined samples. Conclusions: Even in healthy samples, outliers may exist. Combining traditional and robust statistical techniques provide a good method of identifying outliers in a reference interval setting. Laboratories in general do not have a well-defined healthy group from which to compute reference intervals. The effect of nonhealthy individuals in the computation increases reference interval width by ∼10%. However, there is a large deviation among analytes.

...read moreread less

263 citations

Book Chapter•DOI•

Robustness in Statistics

[...]

S. Morgenthaler¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

01 Jan 2001

TL;DR: The Huber estimator as discussed by the authors is the best robust estimator in the sense that its maximal variance over all contaminated distributions is as small as possible, whereas estimators with bounded influence functions are desirable.

...read moreread less

Abstract: Results of a data analysis can only be convincing if they are based on sound methods. Robustness is a crucial attribute of the quality of a data analytic method. Among the various indicators of robustness, the sensitivity and influence functions are the most basic. They both assess the influence exerted by an additional observation, taking an arbitrary value, on the result of the data analysis. Large values of the influence function of an estimator indicate weaknesses, whereas estimators with bounded influence functions are desirable. The breakdown point of a statistical method is the minimal fraction of the observations which, when manipulated, can totally destroy the meaning of the results. This indicator is of a global nature, because it assesses the effects of contaminations of sizable fractions of the cases in a data set rather than a single case. Contaminated distributions such as (1−e)Φ(y)+eΦ(y/σ) are useful in modeling outliers. With a probability of (1−e) this mixture distribution results in an observation from a normal distribution Φ, whereas with probability e the contaminating distribution Φ(y/σ), which has an increased variance σ>1, generates the observation. The Huber estimator is the best robust estimator in the sense that its maximal variance over all contaminated distributions is as small as possible.

...read moreread less

249 citations

Book Chapter•DOI•

Combining One-Class Classifiers

[...]

David M. J. Tax¹, Robert P. W. Duin¹•Institutions (1)

Delft University of Technology¹

02 Jul 2001

TL;DR: This paper investigates if and how one-class classifiers can be combined best in a handwritten digit recognition problem and shows how this can increase the robustness of the classification.

...read moreread less

Abstract: In the problem of one-class classification target objects should be distinguished from outlier objects. In this problem it is assumed that only information of the target class is available while nothing is known about the outlier class. Like standard two-class classifiers, one-class classifiers hardly ever fit the data distribution perfectly. Using only the best classifier and discarding the classifiers with poorer performance might waste valuable information. To improve performance the results of different classifiers (which may differ in complexity or training algorithm) can be combined. This can not only increase the performance but it can also increase the robustness of the classification. Because for one-class classifiers only information of one of the classes is present, combining one-class classifiers is more difficult. In this paper we investigate if and how one-class classifiers can be combined best in a handwritten digit recognition problem.

...read moreread less

188 citations

Proceedings Article•DOI•

Detecting graph-based spatial outliers: algorithms and applications (a summary of results)

[...]

Shashi Shekhar¹, Chang-Tien Lu¹, Pusheng Zhang¹•Institutions (1)

University of Minnesota¹

26 Aug 2001

TL;DR: This paper defines statistical tests, analyzes the statistical foundation underlying the approach, design several fast algorithms to detect spatial outliers, and provides a cost model for outlier detection procedures.

...read moreread less

Abstract: Identification of outliers can lead to the discovery of unexpected, interesting, and useful knowledge. Existing methods are designed for detecting spatial outliers in multidimensional geometric data sets, where a distance metric is available. In this paper, we focus on detecting spatial outliers in graph structured data sets. We define statistical tests, analyze the statistical foundation underlying our approach, design several fast algorithms to detect spatial outliers, and provide a cost model for outlier detection procedures. In addition, we provide experimental results from the application of our algorithms on a Minneapolis-St.Paul(Twin Cities) traffic dataset to show their effectiveness and usefulness.

...read moreread less

185 citations

Proceedings Article•DOI•

Overcoming limitations of sampling for aggregation queries

[...]

Surajit Chaudhuri¹, Gautam Das¹, Mayur Datar², Rajeev Motwani², Vivek Narasayya¹ - Show less +1 more•Institutions (2)

Microsoft¹, Stanford University²

02 Apr 2001

TL;DR: It is demonstrated that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone.

...read moreread less

Abstract: Studies the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier indexing. Uniform sampling is also ineffective for queries with low selectivity. We rely on weighted sampling based on workload information to overcome this shortcoming. We demonstrate that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone. We discuss the implementation of these techniques on Microsoft's SQL Server and present experimental results that demonstrate the merits of our techniques.

...read moreread less

177 citations

Proceedings Article•DOI•

Real-time feature tracking and outlier rejection with changes in illumination

[...]

Hailin Jin¹, Paolo Favaro¹, Stefano Soatto²•Institutions (2)

Washington University in St. Louis¹, University of California, Los Angeles²

07 Jul 2001

TL;DR: An efficient algorithm to track point features supported by image patches undergoing affine deformations and changes in illumination is developed based on a combined model of geometry and photometry that is used to track features as well as to detect outliers in a hypothesis testing framework.

...read moreread less

Abstract: We develop an efficient algorithm to track point features supported by image patches undergoing affine deformations and changes in illumination. The algorithm is based on a combined model of geometry and photometry, that is used to track features as well as to detect outliers in a hypothesis testing framework. The algorithm runs in real time on a personal computer; and is available to the public.

...read moreread less

Journal Article•DOI•

Outlier Detection in Regression Models with ARIMA Errors using Robust Estimates

[...]

Ana M. Bianco¹, M. García Ben¹, E. J. Martinez¹, Victor J. Yohai¹•Institutions (1)

University of Buenos Aires¹

01 Dec 2001-Journal of Forecasting

TL;DR: A Monte Carlo study shows that this diagnostic procedure for detecting additive and innovation outliers as well as level shifts in a regression model with ARIMA errors is more powerful than the classical methods based on maximum likelihood type estimates and Kalman filtering.

...read moreread less

Abstract: A diagnostic procedure for detecting additive and innovation outliers as well as level shifts in a regression model with ARIMA errors is introduced. The procedure is based on a robust estimate of the model parameters and on innovation residuals computed by means of robust filtering. A Monte Carlo study shows that, when there is a large proportion of outliers, this procedure is more powerful than the classical methods based on maximum likelihood type estimates and Kalman filtering. Copyright © 2001 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•DOI•

Unsupervised target detection in hyperspectral images using projection pursuit

[...]

Shao-Shan Chiang¹, Chein-I Chang², I.W. Ginsberg³•Institutions (3)

University of Baltimore¹, University of Maryland, Baltimore County², United States Department of Energy³

01 Jul 2001-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: The proposed PP method is to project a high dimensional data set into a low dimensional data space while retaining desired information of interest while utilizing a projection index to explore projections of interestingness.

...read moreread less

Abstract: The authors present a projection pursuit (PP) approach to target detection. Unlike most of developed target detection algorithms that require statistical models such as linear mixture, the proposed PP is to project a high dimensional data set into a low dimensional data space while retaining desired information of interest. It utilizes a projection index to explore projections of interestingness. For target detection applications in hyperspectral imagery, an interesting structure of an image scene is the one caused by man-made targets in a large unknown background. Such targets can be viewed as anomalies in an image scene due to the fact that their size is relatively small compared to their background surroundings. As a result, detecting small targets in an unknown image scene is reduced to finding the outliers of background distributions. It is known that "skewness," is defined by normalized third moment of the sample distribution, measures the asymmetry of the distribution and "kurtosis" is defined by normalized fourth moment of the sample distribution measures the flatness of the distribution. They both are susceptible to outliers. So, using skewness and kurtosis as a base to design a projection index may be effective for target detection. In order to find an optimal projection index, an evolutionary algorithm is also developed to avoid trapping local optima. The hyperspectral image experiments show that the proposed PP method provides an effective means for target detection.

...read moreread less

Journal Article•DOI•

Simultaneous Registration of Multiple Corresponding Point Sets

[...]

John Williams, Mohammed Bennamoun

01 Jan 2001-Computer Vision and Image Understanding

TL;DR: The proposed technique requires the computation of a constant matrix which encodes the point correspondence information, followed by an efficient iterative algorithm to compute the optimal rotations and recovered directly through the solution of a linear equation system.

...read moreread less

Journal Article•DOI•

Redescending estimators for data reconciliation and parameter estimation

[...]

Nikhil Arora¹, Lorenz T. Biegler¹•Institutions (1)

Carnegie Mellon University¹

15 Nov 2001-Computers & Chemical Engineering

TL;DR: In this article, the three part redescending estimator of Hampel was compared with the Fair function of Huber estimator and the Fair estimator for data reconciliation and parameter estimation.

...read moreread less

Proceedings Article•DOI•

Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner

[...]

Kenji Yamanishi¹, Jun'ichi Takeuchi¹•Institutions (1)

NEC¹

26 Aug 2001

TL;DR: Applying of this framework to the network intrusion detection, it is demonstrated that it can significantly improve the accuracy of SmartSifter, and outlier filtering rules can help the user to discover a general pattern of an outlier group.

...read moreread less

Abstract: This paper is concerned with the problem of detecting outliers from unlabeled data. In prior work we have developed SmartSifter, which is an on-line outlier detection algorithm based on unsupervised learning from data. On the basis of SmartSifter this paper yields a new framework for outlier filtering using both supervised and unsupervised learning techniques iteratively in order to make the detection process more effective and more understandable. The outline of the framework is as follows: In the first round, for an initial dataset, we run SmartSifter to give each data a score, with a high score indicating a high possibility of being an outlier. Next, giving positive labels to a number of higher scored data and negative labels to a number of lower scored data, we create labeled examples. Then we construct an outlier filtering rule by supervised learning from them. Here the rule is generated based on the principle of minimizing extended stochastic complexity. In the second round, for a new dataset, we filter the data using the constructed rule, then among the filtered data, we run SmartSifter again to evaluate the data in order to update the filtering rule. Applying of our framework to the network intrusion detection, we demonstrate that 1) it can significantly improve the accuracy of SmartSifter, and 2) outlier filtering rules can help the user to discover a general pattern of an outlier group.

...read moreread less

Journal Article•DOI•

The Cusum of Squares Test for Scale Changes in Infinite Order Moving Average Processes

[...]

Sangyeol Lee¹, Siyun Park¹•Institutions (1)

Seoul National University¹

01 Dec 2001-Scandinavian Journal of Statistics

TL;DR: In this article, the authors considered the problem of testing for a scale change in the infinite order moving average process X j = Σ∞ i=0 a i e j-i, where e j are i.i.d.s with E|e 1 | α 0.

...read moreread less

Abstract: In this paper we consider the problem of testing for a scale change in the infinite order moving average process X j = Σ∞ i=0 a i e j-i , where e j are i.i.d. r.v.s with E|e 1 | α 0. In performing the test, a cusum of squares test statistic analogous to Inclan & Tiao's (1994) statistic is considered. It is well-known from the literature that outliers affect test procedures leading to false conclusions. In order to remedy this, a cusum of squares test based on trimmed observations is considered. It is demonstrated that this test is robust against outliers and is valid for infinite variance processes as well. Simulation results are given for illustration.

...read moreread less

Proceedings Article•DOI•

A semi-direct approach to structure from motion

[...]

Paolo Favaro¹, Hailin Jin², Stefano Soatto³•Institutions (3)

Washington University in St. Louis¹, University of Washington², University of California, Los Angeles³

26 Sep 2001

TL;DR: A method to perform global registration of local estimates of motion and structure by matching the appearance of feature regions stored over long time periods by matching regions – rather than points – using explicit photometric deformation models.

...read moreread less

Abstract: Reconstructing three-dimensional structure and motion is often decomposed into two steps: point feature correspondence and three-dimensional reconstruction. This separation often causes gross errors since correspondence relies on the brightness constancy constraint that is local in space and time. Therefore, we advocate the necessity to integrate visual information not only in time (i.e. across different views), but also in space, by matching regions-rather than points-using explicit photometric deformation models. We present an algorithm that integrates 2D region tracking and 3D motion estimation into a closed loop based on an explicit geometric and photometric model, while detecting and rejecting outlier regions that do not fit the model. Our algorithm is recursive and suitable for real-time implementation. Our experiments show that it far exceeds the accuracy and robustness of point feature-based SFM algorithms.

...read moreread less

Journal Article•DOI•

Estimating mean exposures from censored data: exposure to benzene in the Australian petroleum industry

[...]

Deborah Catherine Glass¹, Chris Gray²•Institutions (2)

Monash University¹, University of Melbourne²

01 May 2001-Annals of Occupational Hygiene

TL;DR: A retrospective assessment of exposure to benzene was carried out for a nested case control study of lympho-haematopoietic cancers, including leukaemia, in the Australian petroleum industry, finding the half limit of detection method was most suitable in this particular study.

...read moreread less

Abstract: A retrospective assessment of exposure to benzene was carried out for a nested case control study of lympho-haematopoietic cancers, including leukaemia, in the Australian petroleum industry. Each job or task in the industry was assigned a Base Estimate (BE) of exposure derived from task-based personal exposure assessments carried out by the company occupational hygienists. The BEs corresponded to the estimated arithmetic mean exposure to benzene for each job or task and were used in a deterministic algorithm to estimate the exposure of subjects in the study. Nearly all of the data sets underlying the BEs were found to contain some values below the limit of detection (LOD) of the sampling and analytical methods and some were very heavily censored; up to 95% of the data were below the LOD in some data sets. It was necessary, therefore, to use a method of calculating the arithmetic mean exposures that took into account the censored data. Three different methods were employed in an attempt to select the most appropriate method for the particular data in the study. A common method is to replace the missing (censored) values with half the detection limit. This method has been recommended for data sets where much of the data are below the limit of detection or where the data are highly skewed; with a geometric standard deviation of 3 or more. Another method, involving replacing the censored data with the limit of detection divided by the square root of 2, has been recommended when relatively few data are below the detection limit or where data are not highly skewed. A third method that was examined is Cohen's method. This involves mathematical extrapolation of the left-hand tail of the distribution, based on the distribution of the uncensored data, and calculation of the maximum likelihood estimate of the arithmetic mean. When these three methods were applied to the data in this study it was found that the first two simple methods give similar results in most cases. Cohen's method on the other hand, gave results that were generally, but not always, higher than simpler methods and in some cases gave extremely high and even implausible estimates of the mean. It appears that if the data deviate substantially from a simple log-normal distribution, particularly if high outliers are present, then Cohen's method produces erratic and unreliable estimates. After examining these results, and both the distributions and proportions of censored data, it was decided that the half limit of detection method was most suitable in this particular study.

...read moreread less

Journal Article•DOI•

A comparison of multivariate outlier detection methods for clinical laboratory safety data

[...]

Kay I Penny¹, Ian T. Jolliffe²•Institutions (2)

Edinburgh Napier University¹, University of Aberdeen²

01 Sep 2001-The Statistician

TL;DR: Six techniques for finding multivariate outliers on a typical laboratory safety data set show that some methods do better than others depending on whether or not the data set is multivariate normal.

...read moreread less

Abstract: Summary. During a clinical trial of a new treatment, a large number of variables are measured to monitor the safety of the treatment. It is important to detect outlying observations which may indicate that something abnormal is happening. To do this effectively, techniques are needed for finding multivariate outliers. Six techniques of this sort are described and illustrated on a typical laboratory safety data set. Their properties are investigated more thoroughly by means of a simulation study. The results show that some methods do better than others depending on whether or not the data set is multivariate normal, the dimension of the data set, the type of outlier, the proportion of outliers in a data set and the degree of contamination, i.e. 'outlyingness'. The results indicate that it is desirable to run a battery of multivariate methods on a particular data set in an attempt to highlight possible outliers.

...read moreread less

Journal Article•DOI•

Statistics review 1: Presenting and summarising data

[...]

Elise Whitley¹, Jonathan Ball²•Institutions (2)

University of Bristol¹, St George's Hospital²

29 Nov 2001-Critical Care

TL;DR: The present review is the first in an ongoing guide to medical statistics, using specific examples from intensive care, to describe and summarize the data.

...read moreread less

Abstract: The present review is the first in an ongoing guide to medical statistics, using specific examples from intensive care. The first step in any analysis is to describe and summarize the data. As well as becoming familiar with the data, this is also an opportunity to look for unusually high or low values (outliers), to check the assumptions required for statistical tests, and to decide the best way to categorize the data if this is necessary. In addition to tables and graphs, summary values are a convenient way to summarize large amounts of information. This review introduces some of these measures. It describes and gives examples of qualitative data (unordered and ordered) and quantitative data (discrete and continuous); how these types of data can be represented figuratively; the two important features of a quantitative dataset (location and variability); the measures of location (mean, median and mode); the measures of variability (range, interquartile range, standard deviation and variance); common distributions of clinical data; and simple transformations of positively skewed data.

...read moreread less

Journal Article•DOI•

Effect of outliers on estimators and tests in covariance structure analysis.

[...]

Ke-Hai Yuan¹, Peter M. Bentler²•Institutions (2)

University of North Texas¹, University of California, Los Angeles²

01 May 2001-British Journal of Mathematical and Statistical Psychology

TL;DR: This work looks at the quantitative effect of outliers on estimators and test statistics based on normal theory maximum likelihood and the asymptotically distribution-free procedures.

...read moreread less

Abstract: A small proportion of outliers can distort the results based on classical procedures in covariance structure analysis. We look at the quantitative effect of outliers on estimators and test statistics based on normal theory maximum likelihood and the asymptotically distribution-free procedures. Even if a proposed structure is correct for the majority of the data in a sample, a small proportion of outliers leads to biased estimators and significant test statistics. An especially unfortunate consequence is that the power to reject a model can be made arbitrarily--but misleadingly--large by inclusion of outliers in an analysis.

...read moreread less

Journal Article•DOI•

Outliers detection and confidence interval modification in fuzzy regression

[...]

Yun-Shiow Chen¹•Institutions (1)

Yuan Ze University¹

16 Apr 2001-Fuzzy Sets and Systems

TL;DR: The present investigation focuses on nonfuzzy input and fuzzy output data type and proposes approaches to handle the outlier problem and introduces a pre-assigned k -limiting value whose value must be determined based on the conditions of the current problem.

...read moreread less

Patent•

Method of intelligent data analysis to detect abnormal use of utilities in buildings

[...]

John E. Seem¹•Institutions (1)

Johnson Controls¹

20 Jul 2001

TL;DR: In this article, an outlier identification method is employed to detect abnormally high or low energy use in a building, where the utility use is measured periodically throughout each day and the measurements are grouped according to days that have similar average utility consumption levels.

...read moreread less

Abstract: Outlier identification is employed to detect abnormally high or low energy use in a building. The utility use is measured periodically throughout each day and the measurements are grouped according to days that have similar average utility consumption levels. The data in each group is statistically analyzed using the Generalized Extreme Studentized Deviate (GESD) method. That method identifies outliers which are data samples that vary significantly from the majority of the data. The degree to which each outlier deviates from the remainder of the data indicates the severity of the abnormal utility consumption denoted by that outlier. The resultant outlier information is readily discernable by the building operators in accessing whether the cause of a particular occurrence of abnormal utility usage requires further investigation.

...read moreread less

Book•DOI•

Robustness in Data Analysis: Criteria and Methods

[...]

Georgy Shevlyakov, Nikita O. Vilchevski

31 Jan 2001

TL;DR: In this paper, Huber minimax approach Hampel approach optimization criteria in data analysis - a probability-free approach introductory remarks translation and scale equivariant contrast functions orthogonal equivariants contrast functions monotonically equivariANT contrast functions minimal sensitivity to small perturbations in the data affine equivariate contrast functions robust mimimax estimation of location introductory remarks robust estimation of locations in models with bounded variances robust estimation estimators of locations with bounded subranges robust estimators.

...read moreread less

Abstract: General remarks Huber minimax approach Hampel approach optimization criteria in data analysis - a probability-free approach introductory remarks translation and scale equivariant contrast functions orthogonal equivariant contrast functions monotonically equivariant contrast functions minimal sensitivity to small perturbations in the data affine equivariate contrast functions robust mimimax estimation of location introductory remarks robust estimation of location in models with bounded variances robust estimation of location in models with bounded subranges robust estimators of multivariate location least informative lattice distributions robust estimation of scale introductory remarks measures of scale defined by functionals M-, L-, and R-estimators of scale Huber minimax estimator of scale final remarks robust regression and autoregression introductory remarks the minimax variance regression robust autoregression robust identification in dynamic models final remarks robustness of L1-norm estimators introductory remarks stability of L1-approximations robustness of the L1-regression final remarks robust estimation of correlation introductory remarks analysis - Monte Carlo experiment analysis - asymptotic characteristics synthesis minimax variance correlation two-stage estimators - rejection of outliers plus classics computation and data analysis technologies introductory remarks on computation adaptive robust procedures smoothing quantile functions by the Bernstein polynomials robust bivariate boxplots applications on robust elimination in the statistical theory of reliability robust detection of signals based on optimisation criteria statistical analysis of sudden cardiac death risk factors.

...read moreread less

Journal Article•DOI•

Data mining on time series: an illustration using fast-food restaurant franchise data

[...]

Lon-Mu Liu¹, Siddhartha Bhattacharyya¹, Stanley L. Sclove¹, Rong Chen¹, William J. Lattyak - Show less +1 more•Institutions (1)

University of Illinois at Chicago¹

28 Oct 2001-Computational Statistics & Data Analysis

TL;DR: A fast-food restaurant franchise is used as a case to illustrate how data mining can be applied to such time series, and help the franchise reap the benefits of such an effort.

...read moreread less

Journal Article•DOI•

A comparative analysis of multiple outlier detection procedures in the linear regression model

[...]

James W. Wisnowski¹, Douglas C. Montgomery², James R. Simpson³•Institutions (3)

United States Air Force Academy¹, Arizona State University², Florida State University³

28 May 2001-Computational Statistics & Data Analysis

TL;DR: Evaluating several published techniques to detect multiple outliers in linear regression using an extensive Monte Carlo simulation finds the impact of outlier density and geometry, regressor variable dimension, and outlying distance on detection capability and false alarm (swamping) probability.

...read moreread less

Book Chapter•DOI•

Forecasting Analogous Time Series

[...]

George T. Duncan¹, Wilpen L. Gorr¹, Janusz Szczypula¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2001

TL;DR: This work takes the Bayesian pooling approach to drawing information from analogous time series to model and forecast a given time series, and combines estimated parameters of the group model with conventional time-series-model parameters, using so-called weights shrinkage.

...read moreread less

Abstract: Organizations that use time-series forecasting regularly, generally use it for many products or services. Among the variables they forecast are groups of analogous time series (series that follow similar, time-based patterns). Their covariation is a largely untapped source of information that can improve forecast accuracy. We take the Bayesian pooling approach to drawing information from analogous time series to model and forecast a given time series. In using Bayesian pooling, we use data from analogous time series as multiple observations per time period in a group-level model. We then combine estimated parameters of the group model with conventional time-series-model parameters, using so-called weights shrinkage. Major benefits of this approach are that it (1) requires few parameters for estimation; (2) builds directly on conventional time-series models; (3) adapts to pattern changes in time series, providing rapid adjustments and accurate model estimates; and (4) screens out adverse effects of outlier data points on time-series model estimates. For practitioners, we provide the terms, concepts, and methods necessary for a basic understanding of Bayesian pooling and the conditions under which it improves upon conventional time-series methods. For researchers, we describe the experimental data, treatments, and factors needed to compare the forecast accuracy of pooling methods. Last, we present basic principles for applying pooling methods and supporting empirical results. Conditions favoring pooling include time series with high volatility and outliers. Simple pooling methods are more accurate than complex methods, and we recommend manual intervention for cases with few time series.

...read moreread less

Proceedings Article•DOI•

Outlier finding: focusing user attention on possible errors

[...]

Robert C. Miller¹, Brad A. Myers¹•Institutions (1)

Carnegie Mellon University¹

11 Nov 2001

TL;DR: An outlier finder for text is implemented, which can detect both unusual matches and unusual mismatches to a text pattern, which substantially reduced errors when integrated into the user interface of a PBD text editor and tested in a user study.

...read moreread less

Abstract: When users handle large amounts of data, errors are hard to notice. Outlier finding is a new way to reduce errors by directing the user's attention to inconsistent data which may indicate errors. We have implemented an outlier finder for text, which can detect both unusual matches and unusual mismatches to a text pattern. When integrated into the user interface of a PBD text editor and tested in a user study, outlier finding substantially reduced errors.

...read moreread less

Journal Article•DOI•

A Test for Asymmetry with Leptokurtic Financial Data

[...]

Gamini Premaratne¹, Gamini Premaratne², Anil K. Bera¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Universiti Brunei Darussalam²

30 Jul 2001-Social Science Research Network

TL;DR: This paper proposed a parametric test for symmetry based on the Pearson type IV family of distributions, which takes account of leptokurtosis explicitly, and showed that the test performs quite well in finite samples, and is robust to excess kurtosis.

...read moreread less

Abstract: Most of the tests for asymmetry are developed under the null hypothesis of normal distribution. As is well known, many financial data exhibits fat tail, and commonly used tests (such as the standard square root test based on sample skewness) are not valid for leptokurtic financial data. Also, the square root test uses the third moment, which may not be robust in the presence of gross outliers. In this paper, we propose a simple parametric test for symmetry based on the Pearson type IV family of distributions, which take account of leptokurtosis explicitly. Our test is based on a function that bounded over the real line, and we expect it to be more well behaved than the test based on sample skewness (third moment). Results from our Monte Carlo study reveal that the suggested test performs quite well in finite samples, and it is robust to excess kurtosis. We also apply the test to stock return data to illustrate its usefulness.

...read moreread less