scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2001"


Proceedings ArticleDOI
Charu C. Aggarwal1, Philip S. Yu1
01 May 2001
TL;DR: New techniques for outlier detection which find the outliers by studying the behavior of projections from the data set are discussed.
Abstract: The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions. Consequently, for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non-obvious. In this paper, we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set.

1,132 citations


Proceedings ArticleDOI
07 Jul 2001
TL;DR: The theory of Robust Principal Component Analysis is developed and a robust M-estimation algorithm is described for learning linear multi-variate representations of high dimensional data such as images, which illustrates the benefits of RPCA when outliers are present.
Abstract: Principal Component Analysis (PCA) has been widely used for the representation of shape, appearance and motion. One drawback of typical PCA methods is that they are least squares estimation techniques and hence fail to account for "outliers" which are common in realistic training sets. In computer vision applications, outliers typically occur within a sample (image) due to pixels that are corrupted by noise, alignment errors, or occlusion. We review previous approaches for making PCA robust to outliers and present a new method that uses an intra-sample outlier process to account for pixel outliers. We develop the theory of Robust Principal Component Analysis (RPCA) and describe a robust M-estimation algorithm for learning linear multi-variate representations of high dimensional data such as images. Quantitative comparisons with traditional PCA and previous robust algorithms illustrate the benefits of RPCA when outliers are present. Details of the algorithm are described and a software implementation is being made publically available.

378 citations


Proceedings ArticleDOI
26 Aug 2001
TL;DR: A novel method to efficiently find the top-n local outliers in large databases using the concept of "micro-cluster" is proposed to compress the data.
Abstract: Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF) which represents the likelihood of that object being an outlier. Although the concept of local outliers is a useful one, the computation of LOF values for every data objects requires a large number of k-nearest neighbors searches and can be computationally expensive. Since most objects are usually not outliers, it is useful to provide users with the option of finding only n most outstanding local outliers, i.e., the top-n data objects which are most likely to be local outliers according to their LOFs. However, if the pruning is not done carefully, finding top-n outliers could result in the same amount of computation as finding LOF for all objects. In this paper, we propose a novel method to efficiently find the top-n local outliers in large databases. The concept of "micro-cluster" is introduced to compress the data. An efficient micro-cluster-based local outlier mining algorithm is designed based on this concept. As our algorithm can be adversely affected by the overlapping in the micro-clusters, we proposed a meaningful cut-plane solution for overlapping data. The formal analysis and experiments show that this method can achieve good performance in finding the most outstanding local outliers.

356 citations


Journal ArticleDOI
TL;DR: The authors proposed a hierarchical linear mixed-effects model for orthodontic data, in which the random effects and the within-subject errors have multivariate t-distributions with known or unknown degrees of freedom.
Abstract: Linear mixed-effects models are frequently used to analyze repeated measures data, because they model flexibly the within-subject correlation often present in this type of data. The most popular linear mixed-effects model for a continuous response assumes normal distributions for the random effects and the within-subject errors, making it sensitive to outliers. Such outliers are more problematic for mixed-effects models than for fixed-effects models, because they may occur in the random effects, in the within-subject errors, or in both, making them harder to be detected in practice. Motivated by a real dataset from an orthodontic study, we propose a robust hierarchical linear mixed-effects model in which the random effects and the within-subject errors have multivariate t-distributions, with known or unknown degrees-of-freedom, which are allowed to vary with groups of subjects. By using a gamma-normal hierarchical structure, our model allows the identification and classification of both types of outliers,...

301 citations


Journal ArticleDOI
01 Aug 2001-Genetics
TL;DR: It is found that outlier loci are easier to recognize when this joint distribution is conditioned on the total number of allelic states represented in the pooled sample at each locus, and the conditional distribution is less sensitive to the values of nuisance parameters.
Abstract: Population structure and history have similar effects on the genetic diversity at all neutral loci. However, some marker loci may also have been strongly influenced by natural selection. Selection shapes genetic diversity in a locus-specific manner. If we could identify those loci that have responded to selection during the divergence of populations, then we may obtain better estimates of the parameters of population history by excluding these loci. Previous attempts were made to identify outlier loci from the distribution of sample statistics under neutral models of population structure and history. Unfortunately these methods depend on assumptions about population structure and history that usually cannot be verified. In this article, we define new population-specific parameters of population divergence and construct sample statistics that are estimators of these parameters. We then use the joint distribution of these estimators to identify outlier loci that may be subject to selection. We found that outlier loci are easier to recognize when this joint distribution is conditioned on the total number of allelic states represented in the pooled sample at each locus. This is so because the conditional distribution is less sensitive to the values of nuisance parameters.

285 citations


Journal ArticleDOI
TL;DR: Combining traditional and robust statistical techniques provide a good method of identifying outliers in a reference interval setting, even in healthy samples, and there is a large deviation among analytes.
Abstract: Background: Improvement in reference interval estimation using a new outlier detection technique, even with a physician-determined healthy sample, is examined. The effect of including physician-determined nonhealthy individuals in the sample is evaluated. Methods: Traditional data transformation coupled with robust and exploratory outlier detection methodology were used in conjunction with various reference interval determination techniques. A simulation study was used to examine the effects of outliers on known reference intervals. Physician-defined healthy groups with and without nonhealthy individuals were compared on real data. Results: With 5% outliers in simulated samples, the described outlier detection techniques had narrower reference intervals. Application of the technique to real data provided reference intervals that were, on average, 10% narrower than those obtained when outlier detection was not used. Only 1.6% of the samples were identified as outliers and removed from reference interval determination in both the healthy and combined samples. Conclusions: Even in healthy samples, outliers may exist. Combining traditional and robust statistical techniques provide a good method of identifying outliers in a reference interval setting. Laboratories in general do not have a well-defined healthy group from which to compute reference intervals. The effect of nonhealthy individuals in the computation increases reference interval width by ∼10%. However, there is a large deviation among analytes.

263 citations


Book ChapterDOI
01 Jan 2001
TL;DR: The Huber estimator as discussed by the authors is the best robust estimator in the sense that its maximal variance over all contaminated distributions is as small as possible, whereas estimators with bounded influence functions are desirable.
Abstract: Results of a data analysis can only be convincing if they are based on sound methods. Robustness is a crucial attribute of the quality of a data analytic method. Among the various indicators of robustness, the sensitivity and influence functions are the most basic. They both assess the influence exerted by an additional observation, taking an arbitrary value, on the result of the data analysis. Large values of the influence function of an estimator indicate weaknesses, whereas estimators with bounded influence functions are desirable. The breakdown point of a statistical method is the minimal fraction of the observations which, when manipulated, can totally destroy the meaning of the results. This indicator is of a global nature, because it assesses the effects of contaminations of sizable fractions of the cases in a data set rather than a single case. Contaminated distributions such as (1−e)Φ(y)+eΦ(y/σ) are useful in modeling outliers. With a probability of (1−e) this mixture distribution results in an observation from a normal distribution Φ, whereas with probability e the contaminating distribution Φ(y/σ), which has an increased variance σ>1, generates the observation. The Huber estimator is the best robust estimator in the sense that its maximal variance over all contaminated distributions is as small as possible.

249 citations


Book ChapterDOI
02 Jul 2001
TL;DR: This paper investigates if and how one-class classifiers can be combined best in a handwritten digit recognition problem and shows how this can increase the robustness of the classification.
Abstract: In the problem of one-class classification target objects should be distinguished from outlier objects. In this problem it is assumed that only information of the target class is available while nothing is known about the outlier class. Like standard two-class classifiers, one-class classifiers hardly ever fit the data distribution perfectly. Using only the best classifier and discarding the classifiers with poorer performance might waste valuable information. To improve performance the results of different classifiers (which may differ in complexity or training algorithm) can be combined. This can not only increase the performance but it can also increase the robustness of the classification. Because for one-class classifiers only information of one of the classes is present, combining one-class classifiers is more difficult. In this paper we investigate if and how one-class classifiers can be combined best in a handwritten digit recognition problem.

188 citations


Proceedings ArticleDOI
26 Aug 2001
TL;DR: This paper defines statistical tests, analyzes the statistical foundation underlying the approach, design several fast algorithms to detect spatial outliers, and provides a cost model for outlier detection procedures.
Abstract: Identification of outliers can lead to the discovery of unexpected, interesting, and useful knowledge. Existing methods are designed for detecting spatial outliers in multidimensional geometric data sets, where a distance metric is available. In this paper, we focus on detecting spatial outliers in graph structured data sets. We define statistical tests, analyze the statistical foundation underlying our approach, design several fast algorithms to detect spatial outliers, and provide a cost model for outlier detection procedures. In addition, we provide experimental results from the application of our algorithms on a Minneapolis-St.Paul(Twin Cities) traffic dataset to show their effectiveness and usefulness.

185 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: It is demonstrated that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone.
Abstract: Studies the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier indexing. Uniform sampling is also ineffective for queries with low selectivity. We rely on weighted sampling based on workload information to overcome this shortcoming. We demonstrate that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone. We discuss the implementation of these techniques on Microsoft's SQL Server and present experimental results that demonstrate the merits of our techniques.

177 citations


Proceedings ArticleDOI
07 Jul 2001
TL;DR: An efficient algorithm to track point features supported by image patches undergoing affine deformations and changes in illumination is developed based on a combined model of geometry and photometry that is used to track features as well as to detect outliers in a hypothesis testing framework.
Abstract: We develop an efficient algorithm to track point features supported by image patches undergoing affine deformations and changes in illumination. The algorithm is based on a combined model of geometry and photometry, that is used to track features as well as to detect outliers in a hypothesis testing framework. The algorithm runs in real time on a personal computer; and is available to the public.

Journal ArticleDOI
TL;DR: A Monte Carlo study shows that this diagnostic procedure for detecting additive and innovation outliers as well as level shifts in a regression model with ARIMA errors is more powerful than the classical methods based on maximum likelihood type estimates and Kalman filtering.
Abstract: A diagnostic procedure for detecting additive and innovation outliers as well as level shifts in a regression model with ARIMA errors is introduced. The procedure is based on a robust estimate of the model parameters and on innovation residuals computed by means of robust filtering. A Monte Carlo study shows that, when there is a large proportion of outliers, this procedure is more powerful than the classical methods based on maximum likelihood type estimates and Kalman filtering. Copyright © 2001 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The proposed PP method is to project a high dimensional data set into a low dimensional data space while retaining desired information of interest while utilizing a projection index to explore projections of interestingness.
Abstract: The authors present a projection pursuit (PP) approach to target detection. Unlike most of developed target detection algorithms that require statistical models such as linear mixture, the proposed PP is to project a high dimensional data set into a low dimensional data space while retaining desired information of interest. It utilizes a projection index to explore projections of interestingness. For target detection applications in hyperspectral imagery, an interesting structure of an image scene is the one caused by man-made targets in a large unknown background. Such targets can be viewed as anomalies in an image scene due to the fact that their size is relatively small compared to their background surroundings. As a result, detecting small targets in an unknown image scene is reduced to finding the outliers of background distributions. It is known that "skewness," is defined by normalized third moment of the sample distribution, measures the asymmetry of the distribution and "kurtosis" is defined by normalized fourth moment of the sample distribution measures the flatness of the distribution. They both are susceptible to outliers. So, using skewness and kurtosis as a base to design a projection index may be effective for target detection. In order to find an optimal projection index, an evolutionary algorithm is also developed to avoid trapping local optima. The hyperspectral image experiments show that the proposed PP method provides an effective means for target detection.

Journal ArticleDOI
TL;DR: The proposed technique requires the computation of a constant matrix which encodes the point correspondence information, followed by an efficient iterative algorithm to compute the optimal rotations and recovered directly through the solution of a linear equation system.

Journal ArticleDOI
TL;DR: In this article, the three part redescending estimator of Hampel was compared with the Fair function of Huber estimator and the Fair estimator for data reconciliation and parameter estimation.

Proceedings ArticleDOI
Kenji Yamanishi1, Jun'ichi Takeuchi1
26 Aug 2001
TL;DR: Applying of this framework to the network intrusion detection, it is demonstrated that it can significantly improve the accuracy of SmartSifter, and outlier filtering rules can help the user to discover a general pattern of an outlier group.
Abstract: This paper is concerned with the problem of detecting outliers from unlabeled data. In prior work we have developed SmartSifter, which is an on-line outlier detection algorithm based on unsupervised learning from data. On the basis of SmartSifter this paper yields a new framework for outlier filtering using both supervised and unsupervised learning techniques iteratively in order to make the detection process more effective and more understandable. The outline of the framework is as follows: In the first round, for an initial dataset, we run SmartSifter to give each data a score, with a high score indicating a high possibility of being an outlier. Next, giving positive labels to a number of higher scored data and negative labels to a number of lower scored data, we create labeled examples. Then we construct an outlier filtering rule by supervised learning from them. Here the rule is generated based on the principle of minimizing extended stochastic complexity. In the second round, for a new dataset, we filter the data using the constructed rule, then among the filtered data, we run SmartSifter again to evaluate the data in order to update the filtering rule. Applying of our framework to the network intrusion detection, we demonstrate that 1) it can significantly improve the accuracy of SmartSifter, and 2) outlier filtering rules can help the user to discover a general pattern of an outlier group.

Journal ArticleDOI
TL;DR: In this article, the authors considered the problem of testing for a scale change in the infinite order moving average process X j = Σ∞ i=0 a i e j-i, where e j are i.i.d.s with E|e 1 | α 0.
Abstract: In this paper we consider the problem of testing for a scale change in the infinite order moving average process X j = Σ∞ i=0 a i e j-i , where e j are i.i.d. r.v.s with E|e 1 | α 0. In performing the test, a cusum of squares test statistic analogous to Inclan & Tiao's (1994) statistic is considered. It is well-known from the literature that outliers affect test procedures leading to false conclusions. In order to remedy this, a cusum of squares test based on trimmed observations is considered. It is demonstrated that this test is robust against outliers and is valid for infinite variance processes as well. Simulation results are given for illustration.

Proceedings ArticleDOI
26 Sep 2001
TL;DR: A method to perform global registration of local estimates of motion and structure by matching the appearance of feature regions stored over long time periods by matching regions – rather than points – using explicit photometric deformation models.
Abstract: Reconstructing three-dimensional structure and motion is often decomposed into two steps: point feature correspondence and three-dimensional reconstruction. This separation often causes gross errors since correspondence relies on the brightness constancy constraint that is local in space and time. Therefore, we advocate the necessity to integrate visual information not only in time (i.e. across different views), but also in space, by matching regions-rather than points-using explicit photometric deformation models. We present an algorithm that integrates 2D region tracking and 3D motion estimation into a closed loop based on an explicit geometric and photometric model, while detecting and rejecting outlier regions that do not fit the model. Our algorithm is recursive and suitable for real-time implementation. Our experiments show that it far exceeds the accuracy and robustness of point feature-based SFM algorithms.

Journal ArticleDOI
TL;DR: A retrospective assessment of exposure to benzene was carried out for a nested case control study of lympho-haematopoietic cancers, including leukaemia, in the Australian petroleum industry, finding the half limit of detection method was most suitable in this particular study.
Abstract: A retrospective assessment of exposure to benzene was carried out for a nested case control study of lympho-haematopoietic cancers, including leukaemia, in the Australian petroleum industry. Each job or task in the industry was assigned a Base Estimate (BE) of exposure derived from task-based personal exposure assessments carried out by the company occupational hygienists. The BEs corresponded to the estimated arithmetic mean exposure to benzene for each job or task and were used in a deterministic algorithm to estimate the exposure of subjects in the study. Nearly all of the data sets underlying the BEs were found to contain some values below the limit of detection (LOD) of the sampling and analytical methods and some were very heavily censored; up to 95% of the data were below the LOD in some data sets. It was necessary, therefore, to use a method of calculating the arithmetic mean exposures that took into account the censored data. Three different methods were employed in an attempt to select the most appropriate method for the particular data in the study. A common method is to replace the missing (censored) values with half the detection limit. This method has been recommended for data sets where much of the data are below the limit of detection or where the data are highly skewed; with a geometric standard deviation of 3 or more. Another method, involving replacing the censored data with the limit of detection divided by the square root of 2, has been recommended when relatively few data are below the detection limit or where data are not highly skewed. A third method that was examined is Cohen's method. This involves mathematical extrapolation of the left-hand tail of the distribution, based on the distribution of the uncensored data, and calculation of the maximum likelihood estimate of the arithmetic mean. When these three methods were applied to the data in this study it was found that the first two simple methods give similar results in most cases. Cohen's method on the other hand, gave results that were generally, but not always, higher than simpler methods and in some cases gave extremely high and even implausible estimates of the mean. It appears that if the data deviate substantially from a simple log-normal distribution, particularly if high outliers are present, then Cohen's method produces erratic and unreliable estimates. After examining these results, and both the distributions and proportions of censored data, it was decided that the half limit of detection method was most suitable in this particular study.

Journal ArticleDOI
TL;DR: Six techniques for finding multivariate outliers on a typical laboratory safety data set show that some methods do better than others depending on whether or not the data set is multivariate normal.
Abstract: Summary. During a clinical trial of a new treatment, a large number of variables are measured to monitor the safety of the treatment. It is important to detect outlying observations which may indicate that something abnormal is happening. To do this effectively, techniques are needed for finding multivariate outliers. Six techniques of this sort are described and illustrated on a typical laboratory safety data set. Their properties are investigated more thoroughly by means of a simulation study. The results show that some methods do better than others depending on whether or not the data set is multivariate normal, the dimension of the data set, the type of outlier, the proportion of outliers in a data set and the degree of contamination, i.e. 'outlyingness'. The results indicate that it is desirable to run a battery of multivariate methods on a particular data set in an attempt to highlight possible outliers.

Journal ArticleDOI
TL;DR: The present review is the first in an ongoing guide to medical statistics, using specific examples from intensive care, to describe and summarize the data.
Abstract: The present review is the first in an ongoing guide to medical statistics, using specific examples from intensive care. The first step in any analysis is to describe and summarize the data. As well as becoming familiar with the data, this is also an opportunity to look for unusually high or low values (outliers), to check the assumptions required for statistical tests, and to decide the best way to categorize the data if this is necessary. In addition to tables and graphs, summary values are a convenient way to summarize large amounts of information. This review introduces some of these measures. It describes and gives examples of qualitative data (unordered and ordered) and quantitative data (discrete and continuous); how these types of data can be represented figuratively; the two important features of a quantitative dataset (location and variability); the measures of location (mean, median and mode); the measures of variability (range, interquartile range, standard deviation and variance); common distributions of clinical data; and simple transformations of positively skewed data.

Journal ArticleDOI
TL;DR: This work looks at the quantitative effect of outliers on estimators and test statistics based on normal theory maximum likelihood and the asymptotically distribution-free procedures.
Abstract: A small proportion of outliers can distort the results based on classical procedures in covariance structure analysis. We look at the quantitative effect of outliers on estimators and test statistics based on normal theory maximum likelihood and the asymptotically distribution-free procedures. Even if a proposed structure is correct for the majority of the data in a sample, a small proportion of outliers leads to biased estimators and significant test statistics. An especially unfortunate consequence is that the power to reject a model can be made arbitrarily--but misleadingly--large by inclusion of outliers in an analysis.

Journal ArticleDOI
TL;DR: The present investigation focuses on nonfuzzy input and fuzzy output data type and proposes approaches to handle the outlier problem and introduces a pre-assigned k -limiting value whose value must be determined based on the conditions of the current problem.

Patent
John E. Seem1
20 Jul 2001
TL;DR: In this article, an outlier identification method is employed to detect abnormally high or low energy use in a building, where the utility use is measured periodically throughout each day and the measurements are grouped according to days that have similar average utility consumption levels.
Abstract: Outlier identification is employed to detect abnormally high or low energy use in a building. The utility use is measured periodically throughout each day and the measurements are grouped according to days that have similar average utility consumption levels. The data in each group is statistically analyzed using the Generalized Extreme Studentized Deviate (GESD) method. That method identifies outliers which are data samples that vary significantly from the majority of the data. The degree to which each outlier deviates from the remainder of the data indicates the severity of the abnormal utility consumption denoted by that outlier. The resultant outlier information is readily discernable by the building operators in accessing whether the cause of a particular occurrence of abnormal utility usage requires further investigation.

BookDOI
31 Jan 2001
TL;DR: In this paper, Huber minimax approach Hampel approach optimization criteria in data analysis - a probability-free approach introductory remarks translation and scale equivariant contrast functions orthogonal equivariants contrast functions monotonically equivariANT contrast functions minimal sensitivity to small perturbations in the data affine equivariate contrast functions robust mimimax estimation of location introductory remarks robust estimation of locations in models with bounded variances robust estimation estimators of locations with bounded subranges robust estimators.
Abstract: General remarks Huber minimax approach Hampel approach optimization criteria in data analysis - a probability-free approach introductory remarks translation and scale equivariant contrast functions orthogonal equivariant contrast functions monotonically equivariant contrast functions minimal sensitivity to small perturbations in the data affine equivariate contrast functions robust mimimax estimation of location introductory remarks robust estimation of location in models with bounded variances robust estimation of location in models with bounded subranges robust estimators of multivariate location least informative lattice distributions robust estimation of scale introductory remarks measures of scale defined by functionals M-, L-, and R-estimators of scale Huber minimax estimator of scale final remarks robust regression and autoregression introductory remarks the minimax variance regression robust autoregression robust identification in dynamic models final remarks robustness of L1-norm estimators introductory remarks stability of L1-approximations robustness of the L1-regression final remarks robust estimation of correlation introductory remarks analysis - Monte Carlo experiment analysis - asymptotic characteristics synthesis minimax variance correlation two-stage estimators - rejection of outliers plus classics computation and data analysis technologies introductory remarks on computation adaptive robust procedures smoothing quantile functions by the Bernstein polynomials robust bivariate boxplots applications on robust elimination in the statistical theory of reliability robust detection of signals based on optimisation criteria statistical analysis of sudden cardiac death risk factors.

Journal ArticleDOI
TL;DR: A fast-food restaurant franchise is used as a case to illustrate how data mining can be applied to such time series, and help the franchise reap the benefits of such an effort.

Journal ArticleDOI
TL;DR: Evaluating several published techniques to detect multiple outliers in linear regression using an extensive Monte Carlo simulation finds the impact of outlier density and geometry, regressor variable dimension, and outlying distance on detection capability and false alarm (swamping) probability.

Book ChapterDOI
01 Jan 2001
TL;DR: This work takes the Bayesian pooling approach to drawing information from analogous time series to model and forecast a given time series, and combines estimated parameters of the group model with conventional time-series-model parameters, using so-called weights shrinkage.
Abstract: Organizations that use time-series forecasting regularly, generally use it for many products or services. Among the variables they forecast are groups of analogous time series (series that follow similar, time-based patterns). Their covariation is a largely untapped source of information that can improve forecast accuracy. We take the Bayesian pooling approach to drawing information from analogous time series to model and forecast a given time series. In using Bayesian pooling, we use data from analogous time series as multiple observations per time period in a group-level model. We then combine estimated parameters of the group model with conventional time-series-model parameters, using so-called weights shrinkage. Major benefits of this approach are that it (1) requires few parameters for estimation; (2) builds directly on conventional time-series models; (3) adapts to pattern changes in time series, providing rapid adjustments and accurate model estimates; and (4) screens out adverse effects of outlier data points on time-series model estimates. For practitioners, we provide the terms, concepts, and methods necessary for a basic understanding of Bayesian pooling and the conditions under which it improves upon conventional time-series methods. For researchers, we describe the experimental data, treatments, and factors needed to compare the forecast accuracy of pooling methods. Last, we present basic principles for applying pooling methods and supporting empirical results. Conditions favoring pooling include time series with high volatility and outliers. Simple pooling methods are more accurate than complex methods, and we recommend manual intervention for cases with few time series.

Proceedings ArticleDOI
11 Nov 2001
TL;DR: An outlier finder for text is implemented, which can detect both unusual matches and unusual mismatches to a text pattern, which substantially reduced errors when integrated into the user interface of a PBD text editor and tested in a user study.
Abstract: When users handle large amounts of data, errors are hard to notice. Outlier finding is a new way to reduce errors by directing the user's attention to inconsistent data which may indicate errors. We have implemented an outlier finder for text, which can detect both unusual matches and unusual mismatches to a text pattern. When integrated into the user interface of a PBD text editor and tested in a user study, outlier finding substantially reduced errors.

Journal ArticleDOI
TL;DR: This paper proposed a parametric test for symmetry based on the Pearson type IV family of distributions, which takes account of leptokurtosis explicitly, and showed that the test performs quite well in finite samples, and is robust to excess kurtosis.
Abstract: Most of the tests for asymmetry are developed under the null hypothesis of normal distribution. As is well known, many financial data exhibits fat tail, and commonly used tests (such as the standard square root test based on sample skewness) are not valid for leptokurtic financial data. Also, the square root test uses the third moment, which may not be robust in the presence of gross outliers. In this paper, we propose a simple parametric test for symmetry based on the Pearson type IV family of distributions, which take account of leptokurtosis explicitly. Our test is based on a function that bounded over the real line, and we expect it to be more well behaved than the test based on sample skewness (third moment). Results from our Monte Carlo study reveal that the suggested test performs quite well in finite samples, and it is robust to excess kurtosis. We also apply the test to stock return data to illustrate its usefulness.