scispace - formally typeset
Search or ask a question

Showing papers on "Mahalanobis distance published in 2004"


Proceedings Article
01 Dec 2004
TL;DR: A novel method for learning a Mahalanobis distance measure to be used in the KNN classification algorithm that directly maximizes a stochastic variant of the leave-one-out KNN score on the training set.
Abstract: In this paper we propose a novel method for learning a Mahalanobis distance measure to be used in the KNN classification algorithm. The algorithm directly maximizes a stochastic variant of the leave-one-out KNN score on the training set. It can also learn a low-dimensional linear embedding of labeled data that can be used for data visualization and fast classification. Unlike other methods, our classification model is non-parametric, making no assumptions about the shape of the class distributions or the boundaries between them. The performance of the method is demonstrated on several data sets, both for metric learning and linear dimensionality reduction.

1,848 citations


Book ChapterDOI
15 Sep 2004
TL;DR: A payload-based anomaly detector, called PAYL, for intrusion detection that demonstrates the surprising effectiveness of the method on the 1999 DARPA IDS dataset and a live dataset the authors collected on the Columbia CS department network.
Abstract: We present a payload-based anomaly detector, we call PAYL, for intrusion detection. PAYL models the normal application payload of network traffic in a fully automatic, unsupervised and very effecient fashion. We first compute during a training phase a profile byte frequency distribution and their standard deviation of the application payload flowing to a single host and port. We then use Mahalanobis distance during the detection phase to calculate the similarity of new data against the pre-computed profile. The detector compares this measure against a threshold and generates an alert when the distance of the new input exceeds this threshold. We demonstrate the surprising effectiveness of the method on the 1999 DARPA IDS dataset and a live dataset we collected on the Columbia CS department network. In once case nearly 100% accuracy is achieved with 0.1% false positive rate for port 80 traffic.

943 citations


Journal ArticleDOI
TL;DR: This article compares 14 distance measures and their modifications between feature vectors with respect to the recognition performance of the principal component analysis (PCA)-based face recognition method and proposes modified sum square error (SSE)-based distance.

312 citations


01 Jan 2004
TL;DR: The problem of choosing between the two methods of linear discriminant analysis and logistic regression is considered, and some guidelines for proper choice are set.
Abstract: Two of the most widely used statistical methods for analyzing categorical outcome variables are linear discrimina nt analysis and logistic regression. While both are appropriate for the deve lopment of linear classification models, linear discriminant analysis makes more assumptions about the underlying data. Hence, it is assumed tha t logistic regression is the more flexible and more robust method in case of violations of these assumptions. In this paper we consider the problem of choosing between the two methods, and set some guidelines for proper cho ice. The comparison between the methods is based on several measures of predictive accuracy. The performance of the methods is studied by simula tions. We start with an example where all the assumptions of the linear dis criminant analysis are satisfied and observe the impact of changes regardi ng the sample size, covariance matrix, Mahalanobis distance and directi on of distance between group means. Next, we compare the robustness of the methods towards categorisation and non-normality of explanatory var iables in a closely controlled way. We show that the results of LDA and LR are close whenever the normality assumptions are not too badl y violated, and set some guidelines for recognizing these situations. W e discuss the inappropriateness of LDA in all other cases.

279 citations


Journal ArticleDOI
01 May 2004
TL;DR: A comprehensive L/sub p/-type theory of distance metrics for multitarget (and, more generally, multiobject) systems is introduced and it is shown that this theory extends, and provides a rigorous theoretical basis for, an intuitively appealing optimal-assignment approach proposed by Drummond for evaluating the performance of multitarget tracking algorithms.
Abstract: The concept of miss distance-Euclidean, Mahalanobis, etc.-is a fundamental, far-reaching, and taken-for-granted element of the engineering theory and practice of single-target systems. In this paper we introduce a comprehensive L/sub p/-type theory of distance metrics for multitarget (and, more generally, multiobject) systems. We show that this theory extends, and provides a rigorous theoretical basis for, an intuitively appealing optimal-assignment approach proposed by Drummond for evaluating the performance of multitarget tracking algorithms. We describe tractable computational approaches for computing such metrics based on standard optimal assignment or convex optimization techniques. We describe the potentially far-reaching implications of these metrics for applications such as performance evaluation and sensor management. In the former case, we demonstrate the application of multitarget miss-distance metrics as measures of effectiveness (MoEs) for multitarget tracking algorithms.

219 citations


Journal ArticleDOI
TL;DR: Malanobis-type distances in which the shape matrix is derived from a consistent high-breakdown robust multivariate location and scale estimator can be used to find outlying points in a robust clustering method in conjunction with an outlier identification method.

193 citations


Journal ArticleDOI
TL;DR: The Euclidean distance between syntactically linked words in sentences predicts, under ideal conditions, an exponential distribution of the distance between linked words, a trend that can be identified in real sentences.
Abstract: We study the Euclidean distance between syntactically linked words in sentences. The average distance is significantly small and is a very slowly growing function of sentence length. We consider two nonexcluding hypotheses: (a) the average distance is minimized and (b) the average distance is constrained. Support for (a) comes from the significantly small average distance real sentences achieve. The strength of the minimization hypothesis decreases with the length of the sentence. Support for (b) comes from the very slow growth of the average distance versus sentence length. Furthermore, (b) predicts, under ideal conditions, an exponential distribution of the distance between linked words, a trend that can be identified in real sentences.

131 citations


Journal ArticleDOI
TL;DR: It is concluded that the PSTH-based method is an efficient alternative to more sophisticated methods such as LDA and ANNs to study how ensemble of neurons code for discrete sensory stimuli, especially when datasets with many variables are used and when the time resolution of the neural code is one of the factors of interest.

122 citations


Journal ArticleDOI
TL;DR: Results indicate that correlation and covariance structure in the authors' species is stable, and that among-group correlation/covariance similarity is not related to genetic or phenotypic distance, while genetic and morphological distance matrices were highly correlated.
Abstract: Proportionality of phenotypic and genetic distance is of crucial importance to adequately focus on population history and structure, and it depends on the proportionality of genetic and phenotypic covariance. Constancy of phenotypic covariances is unlikely without constancy of genetic covariation if the latter is a substantial component of the former. If phenotypic patterns are found to be relatively stable, the most probable explanation is that genetic covariance matrices are also stable. Factors like morphological integration account for such stability. Morphological integration can be studied by analyzing the relationships among morphological traits. We present here a comparison of phenotypic correlation and covariance structure among worldwide human populations. Correlation and covariance matrices between 47 cranial traits were obtained for 28 populations, and compared with design matrices representing functional and developmental constraints. Among-population differences in patterns of correlation and covariation were tested for association with matrices of genetic distances (obtained after an examination of 10 Alu-insertions) and with Mahalanobis distances (computed after craniometrical traits). All matrix correlations were estimated by means of Mantel tests. Results indicate that correlation and covariance structure in our species is stable, and that among-group correlation/covariance similarity is not related to genetic or phenotypic distance. Conversely, genetic and morphological distance matrices were highly correlated. Correlation and covariation patterns were largely associated with functional and developmental factors, which probably account for the stability of covariance patterns.

121 citations


Proceedings ArticleDOI
27 Jun 2004
TL;DR: The main contribution is a distance learning method, which combines boosting hypotheses over the product space with a weak learner based on partitioning the original feature space, which outperforms existing metric learning methods, which are based an learning a Mahalanobis distance.
Abstract: Image retrieval critically relies on the distance function used to compare a query image to images in the database. We suggest learning such distance functions by training binary classifiers with margins, where the classifiers are defined over the product space of pairs of images. The classifiers are trained to distinguish between pairs in which the images are from the same class and pairs, which contain images from different classes. The signed margin is used as a distance function. We explore several variants of this idea, based on using SVM and boosting algorithms as product space classifiers. Our main contribution is a distance learning method, which combines boosting hypotheses over the product space with a weak learner based on partitioning the original feature space. The weak learner used is a Gaussian mixture model computed using a constrained EM algorithm, where the constraints are equivalence constraints on pairs of data points. This approach allows us to incorporate unlabeled data into the training process. Using some benchmark databases from the UCI repository, we show that our margin based methods significantly outperform existing metric learning methods, which are based an learning a Mahalanobis distance. We then show comparative results of image retrieval in a distributed learning paradigm, using two databases: a large database of facial images (YaleB), and a database of natural images taken from a commercial CD. In both cases our GMM based boosting method outperforms all other methods, and its generalization to unseen classes is superior.

119 citations


01 Jan 2004
TL;DR: A method for the detection of multivariate outliers is proposed which accounts for the data structure and sample size and defines the cut-off value by a measure of deviation of the empirical distribution function of the robust Mahalanobis distance from the theoretical distribution function.
Abstract: A method for the detection of multivariate outliers is proposed which accounts for the data structure and sample size. The cut-off value for identifying outliers is defined by a measure of deviation of the empirical distribution function of the robust Mahalanobis distance from the theoretical distribution function. The method is easy to implement and fast to compute.

Journal ArticleDOI
TL;DR: In this paper, five classification methods were examined to determine the most suitable classification algorithm for the identification of no-till (NT) and traditional tillage (TT) cropping methods: minimum distance (MD), Mahalanobis distance, maximum likelihood (ML), spectral angle mapping (SAM), and the cosine of the angle concept (CAC).

Journal ArticleDOI
TL;DR: A colorimetric characterisation of the more common defects has been carried out and a neural network with a hidden layer is able to classify the olives with an accuracy of over 90%, while partial least squares discriminant and Mahalanobis distance are over 70%.

Journal ArticleDOI
TL;DR: This study provides a comparison of a standard method,based on the Mahalanobis distance, used in multivariate approaches to a robust method based on the minimum volume ellipsoid as a means of determining whether data sets contain outliers or not, and suggests that ecologists consider that their data may contain atypical points.
Abstract: Ecological studies frequently involve large numbers of variables and observations, and these are often subject to various errors. If some data are not representative of the study population, they tend to bias the interpretation and conclusion of an ecological study. Because of the multivariate nature of ecological data, it is very difficult to identify atypical observations using approaches such as univariate or bivariate plots. This difficulty calls for the application of robust statistical methods in identifying atypical observations. Our study provides a comparison of a standard method, based on the Mahalanobis distance, used in multivariate approaches to a robust method based on the minimum volume ellipsoid as a means of determining whether data sets contain outliers or not. We evaluate both methods using simulations varying conditions of the data, and show that the minimum volume ellipsoid approach is superior in detecting outliers where present. We show that, as the sample size parameter, h, used in the robust approach increases in value, there is a decrease in the accuracy and precision of the associated estimate of the number of outliers present, in particular as the number of outliers increases. Conversely, where no outliers are present, large values for the parameter provide the most accurate results. In addition to the simulation results, we demonstrate the use of the robust principal component analysis with a data set of lake-water chemistry variables to illustrate the additional insight available. We suggest that ecologists consider that their data may contain atypical points. Following checks associated with normality, bivariate linearity and other traditional aspects, we advocate that ecologists examine their data sets using robust multivariate methods. Points identified as being atypical should be carefully evaluated based on background information to determine their suitability for inclusion in further multivariate analyses and whether additional factors explain their unusual characteristics. Copyright © 2004 John Wiley & Sons, Ltd.

Book ChapterDOI
01 Dec 2004
TL;DR: It is shown that the Euclidean distance squared transform requires fewer computations than the commonly used 5x5 chamfer transform.
Abstract: Within image analysis the distance transform has many applications. The distance transform measures the distance of each object point from the nearest boundary. For ease of computation, a commonly used approximate algorithm is the chamfer distance transform. This paper presents an efficient linear- time algorithm for calculating the true Euclidean distance-squared of each point from the nearest boundary. It works by performing a 1D distance transform on each row of the image, and then combines the results in each column. It is shown that the Euclidean distance squared transform requires fewer computations than the commonly used 5x5 chamfer transform.

Book
25 Feb 2004
TL;DR: This book explains the development of multivariate statistical methods and quality and some of the methods used in this research were developed in the 1980s and 1990s were still in use.
Abstract: Chapter 1: Multivariate Statistical Methods and Quality Chapter 2: Graphical Multivariate Data Display and Data Stratification Chapter 3: Introduction to Multivariate Random Variables, Normal Distribution, and Sampling Properties Chapter 4: Multivariate Analysis of Variance Chapter 5: Principal Component Analysis and Factor Analysis Chapter 6: Discriminant Analysis Chapter 7: Cluster Analysis Chapter 8: Mahalanobis Distance and Taguchi Method Chapter 9: Path Analysis and the Structural Method Chapter 10: Multivariate Statistical Process Control APPENDIX: PROBABILITY DISTRIBUTION TABLES REFERENCES INDEX

Proceedings ArticleDOI
01 Jan 2004
TL;DR: Different classification algorithms including maximum likelihood classifier (MLC), Gaussian mixture model (GMM), neural network (NN), K-nearest neighbors (K-NN), and Fisher's linear discriminant analysis (FLDA) are compared and recognition results show that FLDA gives the best recognition accuracy by using the selected features.
Abstract: This paper presents our recent work on recognizing human emotion from the speech signal. The proposed recognition system was tested over a language, speaker, and context independent emotional speech database. Prosodic, Mel-frequency cepstral coefficient (MFCC), and formant frequency features are extracted from the speech utterances. We perform feature selection by using the stepwise method based on Mahalanobis distance. The selected features are used to classify the speeches into their corresponding emotional classes. Different classification algorithms including maximum likelihood classifier (MLC), Gaussian mixture model (GMM), neural network (NN), K-nearest neighbors (K-NN), and Fisher's linear discriminant analysis (FLDA) are compared in this study. The recognition results show that FLDA gives the best recognition accuracy by using the selected features.

Journal ArticleDOI
TL;DR: A refinement of the Ward's clustering method, which was iterative and tedious, as it was necessary to re-estimate the spatial covariance structure at each step, is described, but the methodology is improved using the fast fourier Transform method to find the covariances structure.

Journal ArticleDOI
TL;DR: It is shown that the quality of an approximate distance field may be characterized locally near the boundary by its order of normalization and can be studied in terms of the field derivatives.
Abstract: For a given set of points S, a Euclidean distance field is defined by associating with every point p of Euclidean space Ed a value that is equal to the Euclidean distance from p to S. Such distance fields have numerous computational applications, but are expensive to compute and may not be sufficiently smooth for some applications. Instead, popular implicit modeling techniques rely on various approximate fields constructed in a piecewise manner. All such constructions lead to sacrifices in distance properties that have not been properly studied or characterized. We show that the quality of an approximate distance field may be characterized locally near the boundary by its order of normalization and can be studied in terms of the field derivatives. The approach allows systematic quantitative assessment and comparison of various construction methods. In particular, we provide detailed analysis of several popular field construction methods that rely on set decompositions and R-functions, as well as identify the key factors affecting the quality of the constructed fields.

Journal ArticleDOI
TL;DR: An extension of the neural gas vector quantization method to local principal component analysis for competition between local units combines a normalized Mahalanobis distance in the principal subspace and the squared reconstruction error.

Proceedings ArticleDOI
12 Aug 2004
TL;DR: In this article, the authors used Elliptically Contoured Distributions (ECDs) to model the statistical variability of hyperspectral imaging (HSI) data and used the Exceedance metric to improve the accuracy of the model to match the long probabilistic tails of the data.
Abstract: Developing proper models for Hyperspectral imaging (HSI) data allows for useful and reliable algorithms for data exploitation. These models provide the foundation for development and evaluation of detection, classification, clustering, and estimation algorithms. To date, most algorithms have modeled real data as multivariate normal, however it is well known that real data often exhibits non-normal behavior. In this paper, Elliptically Contoured Distributions (ECDs) are used to model the statistical variability of HSI data. Non-homogeneous data sets can be modeled as a finite mixture of more than one ECD, with different means and parameters for each component. A larger family of distributions, the family of ECDs includes the multivariate normal distribution and exhibits most of its properties. ECDs are uniquely defined by their multivariate mean, covariance and the distribution of its Mahalanobis distance metric. This metric lets multivariate data be identified using a univariate statistic and can be adjusted to more closely match the longer tailed distributions of real data. One ECD member of focus is the multivariate t-distribution, which provides longer tailed distributions than the normal, and has an F-distributed Mahalanobis distance statistic. This work will focus on modeling these univariate statistics, using the Exceedance metric, a quantitative goodness-of-fit metric developed specifically to improve the accuracy of the model to match the long probabilistic tails of the data. This metric will be shown to be effective in modeling the univariate Mahalanobis distance distributions of hyperspectral data from the HYDICE sensor as either an F-distribution or as a weighted mixture of F-distributions. This implies that hyperspectral data has a multivariate t-distribution. Proper modeling of Hyperspectral data leads to the ability to generate synthetic data with the same statistical distribution as real world data.

Journal ArticleDOI
27 Jan 2004-Analyst
TL;DR: The ability for SVM to generalize well makes this technique attractive when dealing with limited sized training sets, and results show that discrimination is achievable between the two methods, with SVM performing better than discriminant analysis on the dataset investigated.
Abstract: This paper describes the application of support vector machines (SVM) to analytical chemical data, and is exemplified by the application to the determination of tablet production using pyrolysis-gas chromatography-mass spectrometry. An approach relying on SVM in conjunction with other chemometrics tools such as principal component analysis and discriminant analysis is presented. The ability for SVM to generalize well makes this technique attractive when dealing with limited sized training sets. By using appropriate kernels, SVM result in classifiers of diverse complexity able to draw non-linear decision class boundaries that may suit composite distributions. Principal component analysis and discriminant analysis by means of Mahalanobis distance are used in a stepwise procedure for extracting and selecting meaningful features from the pyrolysis spectrum, in order to feed various SVM classifiers. Results show that discrimination is achievable between the two methods, with SVM performing better than discriminant analysis on the dataset investigated.

Proceedings ArticleDOI
15 Apr 2004
TL;DR: This reconstruction provides an appropriate intra-operative 3D visualization without the need for a pre or intra-operatively imaging, and allows the incorporation of nonspatial data such as patient height and weight.
Abstract: We propose a novel method for reconstructing a complete 3D model of a given anatomy from minimal information. This reconstruction provides an appropriate intra-operative 3D visualization without the need for a pre or intra-operative imaging. Our method #ts a statistical deformable model to sparse 3D data consisting of digitized landmarks and bone surface points. The method also allows the incorporation of nonspatial data such as patient height and weight. The statistical model is constructed using principal component analysis (PCA) from a set of training objects. Our morphing method then computes a Mahalanobis distance weighted least square #t of the model by solving a linear equation system. First experimental promising results with model generated from 14 femoral head are presented.

Journal ArticleDOI
TL;DR: A Monte Carlo study has been performed to illustrate the problems when using stepwise feature selection and discriminant analysis and shows that in order to find the correct features, the necessary ratio of number of training samples to feature candidates is not a constant.

Proceedings ArticleDOI
09 Aug 2004
TL;DR: A comparative study of the performances of these projection approaches for a simple tracking case is presented and the study is extended to the case of road intersections in which a sequential ratio test is presented in order to select the best road segment.
Abstract: The Tracking of a Ground Moving Target (GMTI) is a challenging problem given the environment complexity, the target maneuvers and the false alarm rate. Using the road network information in the tracking process is considered an asset mainly when the target movement is limited to the road. In this paper, we consider different approaches to incorporate the road information into the tracking process: Based on the assumption that the target is following the road network and using a classical estimation technique, the idea is to keep the state estimate on the road by using different "projections" approaches. The first approach is a deterministic one based either on the minimization of the distance between the estimate and its projection on the road or on the minimization of the distance between the measurement and its projection on the road. In this case, the state estimate is updated using the projected measurement. The second approach is a probabilistic one. Given the probability distributions of the measurement error and the state estimate, we propose to use this information in order to maximize the a posteriori measurement probability and the a posteriori estimate probability under the road constraints. This maximization is equivalent to a minimization of the Mahalanobis distance under the same constraints. To differentiate this approach from the deterministic one, we called the projection pseudo projection on the road segment. In this paper, we present a comparative study of the performances of these projection approaches for a simple tracking case. Then we extend the study to the case of road intersections in which we present a sequential ratio test in order to select the best road segment.© (2004) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Journal ArticleDOI
TL;DR: It is shown that combining of the SVDD descriptions improves the retrieval performance with respect to ranking, on the contrary to the Mahalanobis case.
Abstract: A flexible description of images is offered by a cloud of points in a feature space. In the context of image retrieval such clouds can be represented in a number of ways. Two approaches are here considered. The first approach is based on the assumption of a normal distribution, hence homogeneous clouds, while the second one focuses on the boundary description, which is more suitable for multimodal clouds. The images are then compared either by using the Mahalanobis distance or by the support vector data description (SVDD), respectively. The paper investigates some possibilities of combining the image clouds based on the idea that responses of several cloud descriptions may convey a pattern, specific for semantically similar images. A ranking of image dissimilarities is used as a comparison for two image databases targeting image classification and retrieval problems. We show that combining of the SVDD descriptions improves the retrieval performance with respect to ranking, on the contrary to the Mahalanobis case. Surprisingly, it turns out that the ranking of the Mahalanobis distances works well also for inhomogeneous images.

Journal ArticleDOI
TL;DR: In this paper, a modification of the classical Cook's distance is proposed, providing us with a generalized Mahalanobis distance in the context of multivariate elliptical linear regression models.

Journal Article
TL;DR: Results indicate the superiority of ANN classifier over commonly used maximum likelihood and other classifiers and inclusion of one of the short-wave infrared or thermal infrared channels significantly improves the classification accuracy irrespective of the algorithm used.
Abstract: A study was taken up to evaluate the potential of the Gaussian maximum likelihood classifier, Mahalanobis minimum-distance classifier, minimum-distance classifier, and artificial neural network (ANN) classifier in deriving information on land-use/land-cover over part of Ethiopia using various band combinations of Landsat TM data. The values of the Kappa statistics were used to compare the performance of the classifiers, two at a time, by means of a Z-statistics. Results indicate the superiority of ANN classifier over commonly used maximum likelihood and other classifiers. Also, inclusion of one of the short-wave infrared or thermal infrared channels significantly improves the classification accuracy irrespective of the algorithm used.

Book ChapterDOI
19 Aug 2004
TL;DR: A machine learning algorithm based on AdaBoost selects a small number of critical features from a large set and yields extremely efficient classifiers, which give more accurate and reliable matching between model and new images than modeling image intensity alone.
Abstract: The paper describes a machine learning approach for improving active shape model segmentation, which can achieve high detection rates. Rather than represent the image structure using intensity gradients, we extract local edge features for each landmark using steerable filters. A machine learning algorithm based on AdaBoost selects a small number of critical features from a large set and yields extremely efficient classifiers. These non-linear classifiers are used, instead of the linear Mahalanobis distance, to find optimal displacements by searching along the direction perpendicular to each landmark. These features give more accurate and reliable matching between model and new images than modeling image intensity alone. Experimental results demonstrated the ability of this improved method to accurately locate edge features.

Journal ArticleDOI
TL;DR: This study investigates and compares the utility of three Mahalanobis distance (M-distance) measures in identifying and downweighting aberrant item response patterns and indicated that a residual-based M-distance measure had the best properties.
Abstract: Unidimensionality is the hallmark psychometric feature of a well-constructed measurement scale. However, in determining the degree to which a set of items form a unidimensional scale, aberrant item response patterns may distort our investigations. For example, aberrant response patterns may adversely impact interitem covariances which, in turn, can distort estimates of a scale's dimensionality and reliability. In this study, we investigate and compare the utility of three Mahalanobis distance (M-distance) measures in identifying and downweighting aberrant item response patterns. Our findings indicated that a residual-based M-distance measure had the best properties. Specifically, response patterns having greater residual-based M-distances were responsible for observed violations of unidimensionality. When these response patterns were properly downweighted according to this M-distance, the data fitted a one-factor model better and scale reliability increased. The procedures are illustrated using three real data sets.