scispace - formally typeset
Search or ask a question

Showing papers on "Mahalanobis distance published in 2007"


Proceedings ArticleDOI
20 Jun 2007
TL;DR: An information-theoretic approach to learning a Mahalanobis distance function that can handle a wide variety of constraints and can optionally incorporate a prior on the distance function and derive regret bounds for the resulting algorithm.
Abstract: In this paper, we present an information-theoretic approach to learning a Mahalanobis distance function. We formulate the problem as that of minimizing the differential relative entropy between two multivariate Gaussians under constraints on the distance function. We express this problem as a particular Bregman optimization problem---that of minimizing the LogDet divergence subject to linear constraints. Our resulting algorithm has several advantages over existing methods. First, our method can handle a wide variety of constraints and can optionally incorporate a prior on the distance function. Second, it is fast and scalable. Unlike most existing methods, no eigenvalue computations or semi-definite programming are required. We also present an online version and derive regret bounds for the resulting algorithm. Finally, we evaluate our method on a recent error reporting system for software called Clarify, in the context of metric learning for nearest neighbor classification, as well as on standard data sets.

2,058 citations


Book
02 Apr 2007
TL;DR: This book focuses on the development of Chemometrics through the application of unsupervised pattern recognition to the study of Spectroscopy and its applications in medicine and science.
Abstract: Preface. 1 Introduction. 1.1 Development of Chemometrics. 1.2 Application Areas. 1.3 How to Use this Book. 1.4 Literature and Other Sources of Information. References. 2 Experimental Design. 2.1 Why Design Experiments in Chemistry? 2.2 Degrees of Freedom and Sources of Error. 2.3 Analysis of Variance and Interpretation of Errors. 2.4 Matrices, Vectors and the Pseudoinverse. 2.5 Design Matrices. 2.6 Factorial Designs. 2.7 An Example of a Factorial Design. 2.8 Fractional Factorial Designs. 2.9 Plackett-Burman and Taguchi Designs. 2.10 The Application of a Plackett-Burman Design to the Screening of Factors Influencing a Chemical Reaction. 2.11 Central Composite Designs. 2.12 Mixture Designs. 2.13 A Four Component Mixture Design Used to Study Blending of Olive Oils. 2.14 Simplex Optimization. 2.15 Leverage and Confidence in Models. 2.16 Designs for Multivariate Calibration. References. 3 Statistical Concepts. 3.1 Statistics for Chemists. 3.2 Errors. 3.3 Describing Data. 3.4 The Normal Distribution. 3.5 Is a Distribution Normal? 3.6 Hypothesis Tests. 3.7 Comparison of Means: the t-Test. 3.8 F-Test for Comparison of Variances. 3.9 Confidence in Linear Regression. 3.10 More about Confidence. 3.11 Consequences of Outliers and How to Deal with Them. 3.12 Detection of Outliers. 3.13 Shewhart Charts. 3.14 More about Control Charts. References. 4 Sequential Methods. 4.1 Sequential Data. 4.2 Correlograms. 4.3 Linear Smoothing Functions and Filters. 4.4 Fourier Transforms. 4.5 Maximum Entropy and Bayesian Methods. 4.6 Fourier Filters. 4.7 Peakshapes in Chromatography and Spectroscopy. 4.8 Derivatives in Spectroscopy and Chromatography. 4.9 Wavelets. References. 5 Pattern Recognition. 5.1 Introduction. 5.2 Principal Components Analysis. 5.3 Graphical Representation of Scores and Loadings. 5.4 Comparing Multivariate Patterns. 5.5 Preprocessing. 5.6 Unsupervised Pattern Recognition: Cluster Analysis. 5.7 Supervised Pattern Recognition. 5.8 Statistical Classification Techniques. 5.9 K Nearest Neighbour Method. 5.10 How Many Components Characterize a Dataset? 5.11 Multiway Pattern Recognition. References. 6 Calibration. 6.1 Introduction. 6.2 Univariate Calibration. 6.3 Multivariate Calibration and the Spectroscopy of Mixtures. 6.4 Multiple Linear Regression. 6.5 Principal Components Regression. 6.6 Partial Least Squares. 6.7 How Good is the Calibration and What is the Most Appropriate Model? 6.8 Multiway Calibration. References. 7 Coupled Chromatography. 7.1 Introduction. 7.2 Preparing the Data. 7.3 Chemical Composition of Sequential Data. 7.4 Univariate Purity Curves. 7.5 Similarity Based Methods. 7.6 Evolving and Window Factor Analysis. 7.7 Derivative Based Methods. 7.8 Deconvolution of Evolutionary Signals. 7.9 Noniterative Methods for Resolution. 7.10 Iterative Methods for Resolution. 8 Equilibria, Reactions and Process Analytics. 8.1 The Study of Equilibria using Spectroscopy. 8.2 Spectroscopic Monitoring of Reactions. 8.3 Kinetics and Multivariate Models for the Quantitative Study of Reactions 8.4 Developments in the Analysis of Reactions using On-line Spectroscopy. 8.5 The Process Analytical Technology Initiative. References. 9 Improving Yields and Processes Using Experimental Designs. 9.1 Introduction. 9.2 Use of Statistical Designs for Improving the Performance of Synthetic Reactions. 9.3 Screening for Factors that Influence the Performance of a Reaction. 9.4 Optimizing the Process Variables. 9.5 Handling Mixture Variables using Simplex Designs. 9.6 More about Mixture Variables. 10 Biological and Medical Applications of Chemometrics. 10.1 Introduction. 10.2 Taxonomy. 10.3 Discrimination. 10.4 Mahalanobis Distance. 10.5 Bayesian Methods and Contingency Tables. 10.6 Support Vector Machines. 10.7 Discriminant Partial Least Squares. 10.8 Micro-organisms. 10.9 Medical Diagnosis using Spectroscopy. 10.10 Metabolomics using Coupled Chromatography and Nuclear Magnetic Resonance. References. 11 Biological Macromolecules. 11.1 Introduction. 11.2 Sequence Alignment and Scoring Matches. 11.3 Sequence Similarity. 11.4 Tree Diagrams. 11.5 Phylogenetic Trees. References. 12 Multivariate Image Analysis. 12.1 Introduction. 12.2 Scaling Images. 12.3 Filtering and Smoothing the Image. 12.4 Principal Components for the Enhancement of Images. 12.5 Regression of Images. 12.6 Alternating Least Squares as Employed in Image Analysis. 12.7 Multiway Methods In Image Analysis. References. 13 Food. 13.1 Introduction. 13.2 How to Determine the Origin of a Food Product using Chromatography. 13.3 Near Infrared Spectroscopy. 13.4 Other Information. 13.5 Sensory Analysis: Linking Composition to Properties. 13.6 Varimax Rotation. 13.7 Calibrating Sensory Descriptors to Composition. References. Index.

496 citations


01 Jan 2007
TL;DR: In this paper, the similarity between observations is evaluated using both the absolute value and the Mahalanobis distance that includes the propensity score along with other covariates, and a global optimal match with a variable number of controls using network flows is presented.
Abstract: Propensity score-matching methods are often used to control for bias in observational studies when randomization is not possible. This paper describes how to match samples using both local and global optimal matching algorithms. The paper includes macros to perform the nearest available neighbor, caliper, and radius matching methods with or without replacement and matching treated observations to one or many controls. The similarity between observations is evaluated using both the absolute value and the Mahalanobis distance that includes the propensity score along with other covariates. This paper also explains how to find a global optimal match with a variable number of controls using network flows. SAS® 9.1, SAS/STAT®, and SAS/OR® are required.

148 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel method to construct a patient-specific three-dimensional model that provides an appropriate intra-operative visualization without the need for a pre or intra-operatively imaging.

137 citations


Journal ArticleDOI
TL;DR: It is found that propensity-score matching is most effective, but the detailed implementation is not of critical importance, and the difference-in-difference estimator may provide a better measure of program impact.
Abstract: We use administrative data from Missouri to examine the sensitivity of earnings impact estimates for a job training program based on alternative nonexperimental methods. We consider regression adjust- ment, Mahalanobis distance matching, and various methods using propensity-score matching, examining both cross-sectional estimates and difference-in-difference estimates. Specification tests suggest that the difference-in-difference estimator may provide a better measure of pro- gram impact. We find that propensity-score matching is most effective, but the detailed implementation is not of critical importance. Our analyses demonstrate that existing data can be used to obtain useful estimates of program impact.

132 citations


Journal ArticleDOI
TL;DR: In this paper, a framework for investigating predictability based on information theory is presented, which connects and unifies a wide variety of statistical methods traditionally used in predictability analysis, including linear regression, canonical correlation analysis, singular value decomposition, discriminant analysis, and data assimilation.
Abstract: [1] This paper summarizes a framework for investigating predictability based on information theory. This framework connects and unifies a wide variety of statistical methods traditionally used in predictability analysis, including linear regression, canonical correlation analysis, singular value decomposition, discriminant analysis, and data assimilation. Central to this framework is a procedure called predictable component analysis (PrCA). PrCA optimally decomposes variables by predictability, just as principal component analysis optimally decomposes variables by variance. For normal distributions the same predictable components are obtained whether one optimizes predictive information, the dispersion part of relative entropy, mutual information, Mahalanobis error, average signal to noise ratio, normalized mean square error, or anomaly correlation. For joint normal distributions, PrCA is equivalent to canonical correlation analysis between forecast and observations. The regression operator that maps observations to forecasts plays an important role in this framework, with the left singular vectors of this operator being the predictable components and the singular values being the canonical correlations. This correspondence between predictable components and singular vectors occurs only if the singular vectors are computed using Mahalanobis norms, a result that sheds light on the role of norms in predictability. In linear stochastic models the forcing that minimizes predictability is the one that renders the “whitened” dynamical operator normal. This condition for minimum predictability is invariant to linear transformation and is equivalent to detailed balance. The framework also inspires some new approaches to accounting for deficiencies of forecast models and estimating distributions from finite samples.

131 citations


Journal ArticleDOI
TL;DR: Effective combinations of computational methods provide possible classification of human movement intention from single trial EEG with reasonable accuracy and could be the basis for a potential brain-computer interface based on human natural movement, which might reduce the requirement of long-term training.

129 citations


Journal ArticleDOI
TL;DR: In this article, a time series based detection algorithm is proposed utilizing the Gaussian Mixture Models (GMM) for detecting and extent of damage in the ASCE Benchmark Structure simulated data.
Abstract: In this paper, a time series based detection algorithm is proposed utilizing the Gaussian Mixture Models. The two critical aspects of damage diagnosis that are investigated are detection and extent. The vibration signals obtained from the structure are modeled as autoregressive moving average (ARMA) processes. The feature vector used consists of the first three autoregressive coefficients obtained from the modeling of the vibration signals. Damage is detected by observing a migration of the extracted AR coefficients with damage. A Gaussian Mixture Model (GMM) is used to model the feature vector. Damage is detected using the gap statistic, which ascertains the optimal number of mixtures in a particular dataset. The Mahalanobis distance between the mixture in question and the baseline (undamaged) mixture is a good indicator of damage extent. Application cases from the ASCE Benchmark Structure simulated data have been used to test the efficacy of the algorithm. This approach provides a useful framework for data fusion, where different measurements such as strains, temperature, and humidity could be used for a more robust damage decision.

121 citations


Journal ArticleDOI
TL;DR: The concept of isometric embedding is introduced and linked to the concepts of positive and conditionally negative definiteness to demonstrate classes of valid norm dependent isotropic covariance and variogram functions, results many of which have yet to appear in the mainstream geostatistical literature or application.
Abstract: In many scientific disciplines, straight line, Euclidean distances may not accurately describe proximity relationships among spatial data. However, non-Euclidean distance measures must be used with caution in geostatistical applications. A simple example is provided to demonstrate there are no guarantees that existing covariance and variogram functions remain valid (i.e. positive definite or conditionally negative definite) when used with a non-Euclidean distance measure. There are certain distance measures that when used with existing covariance and variogram functions remain valid, an issue that is explored. The concept of isometric embedding is introduced and linked to the concepts of positive and conditionally negative definiteness to demonstrate classes of valid norm dependent isotropic covariance and variogram functions, results many of which have yet to appear in the mainstream geostatistical literature or application. These classes of functions extend the well known classes by adding a parameter to define the distance norm. In practice, this distance parameter can be set a priori to represent, for example, the Euclidean distance, or kept as a parameter to allow the data to choose the metric. A simulated application of the latter is provided for demonstration. Simulation results are also presented comparing kriged predictions based on Euclidean distance to those based on using a water metric.

104 citations


Proceedings Article
01 Dec 2007
TL;DR: The main result shows that under mild conditions, LS-SVM for binaryclass classifications is equivalent to the hard margin SVM based on the well-known Mahalanobis distance measure.
Abstract: We study the relationship between Support Vector Machines (SVM) and Least Squares SVM (LS-SVM). Our main result shows that under mild conditions, LS-SVM for binaryclass classifications is equivalent to the hard margin SVM based on the well-known Mahalanobis distance measure. We further study the asymptotics of the hard margin SVM when the data dimensionality tends to infinity with a fixed sample size. Using recently developed theory on the asymptotics of the distribution of the eigenvalues of the covariance matrix, we show that under mild conditions, the equivalence result holds for the traditional Euclidean distance measure. These equivalence results are further extended to the multi-class case. Experimental results confirm the presented theoretical analysis.

100 citations


Proceedings ArticleDOI
20 Jun 2007
TL;DR: This paper proposes a novel discriminant learning algorithm in correlation measure space, Correlation Discriminant Analysis (CDA), based on the definitions of within- class correlation and between-class correlation, and shows its advantage over alternative methods.
Abstract: Correlation is one of the most widely used similarity measures in machine learning like Euclidean and Mahalanobis distances. However, compared with proposed numerous discriminant learning algorithms in distance metric space, only a very little work has been conducted on this topic using correlation similarity measure. In this paper, we propose a novel discriminant learning algorithm in correlation measure space, Correlation Discriminant Analysis (CDA). In this framework, based on the definitions of within-class correlation and between-class correlation, the optimum transformation can be sought for to maximize the difference between them, which is in accordance with good classification performance empirically. Under different cases of the transformation, different implementations of the algorithm are given. Extensive empirical evaluations of CDA demonstrate its advantage over alternative methods.

Journal ArticleDOI
TL;DR: A new texture feature estimation technique for discriminating images of eight different grades of CTC (cutting, tearing, and curling) tea can discriminate the images of different sized tea granules with more efficiency than the statistical feature vectors do.

Journal ArticleDOI
TL;DR: The new linear pixel-swapping method led to an increase in the accuracy of mapping fine linear features of approximately 5% compared with the conventional pixel- Swapping method.

Journal ArticleDOI
TL;DR: This work investigates the use of different Machine Learning methods to construct models for aqueous solubility, evaluating all approaches in terms of their prediction accuracy and in how far the individual error bars can faithfully represent the actual prediction error.
Abstract: We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.

Journal ArticleDOI
TL;DR: This paper first finds the data structure for each class adaptively in the input space via agglomerative hierarchical clustering, and then construct the weighted Mahalanobis distance (WMD) kernels using the detected data distribution information.
Abstract: The support vector machine (SVM) has been demonstrated to be a very effective classifier in many applications, but its performance is still limited as the data distribution information is underutilized in determining the decision hyperplane. Most of the existing kernels employed in nonlinear SVMs measure the similarity between a pair of pattern images based on the Euclidean inner product or the Euclidean distance of corresponding input patterns, which ignores data distribution tendency and makes the SVM essentially a ldquolocalrdquo classifier. In this paper, we provide a step toward a paradigm of kernels by incorporating data specific knowledge into existing kernels. We first find the data structure for each class adaptively in the input space via agglomerative hierarchical clustering (AHC), and then construct the weighted Mahalanobis distance (WMD) kernels using the detected data distribution information. In WMD kernels, the similarity between two pattern images is determined not only by the Mahalanobis distance (MD) between their corresponding input patterns but also by the sizes of the clusters they reside in. Although WMD kernels are not guaranteed to be positive definite (pd) or conditionally positive definite (cpd), satisfactory classification results can still be achieved because regularizers in SVMs with WMD kernels are empirically positive in pseudo-Euclidean (pE) spaces. Experimental results on both synthetic and real-world data sets show the effectiveness of ldquopluggingrdquo data structure into existing kernels.

Journal ArticleDOI
TL;DR: A content-based representation of a video shot composed by a background (still) mosaic and one or more foreground moving objects that segmentation of moving objects is based on ego-motion compensation and on background modelling using tools from robust statistics.

Journal ArticleDOI
TL;DR: This paper considers tests of multinormality which are based on the Mahalanobis distance between two multivariate location vector estimates or on the (matrix)distance between two scatter matrix estimates, respectively.
Abstract: Classical univariate measures of asymmetry such as Pearson’s (mean-median)/σ or (mean-mode)/σ often measure the standardized distance between two separate location parameters and have been widely used in assessing univariate normality. Similarly, measures of univariate kurtosis are often just ratios of two scale measures. The classical standardized fourth moment and the ratio of the mean deviation to the standard deviation serve as examples. In this paper we consider tests of multinormality which are based on the Mahalanobis distance between two multivariate location vector estimates or on the (matrix) distance between two scatter matrix estimates, respectively. Asymptotic theory is developed to provide approximate null distributions as well as to consider asymptotic efficiencies. Limiting Pitman efficiencies for contiguous sequences of contaminated normal distributions are calculated and the efficiencies are compared to those of the classical tests by Mardia. Simulations are used to compare finite sample efficiencies. The theory is also illustrated by an example.

Journal ArticleDOI
TL;DR: This paper presents a wavelet characteristic based approach for the automated visual inspection of ripple defects in the surface barrier layer (SBL) chips of ceramic capacitors, and compares the defect detection performance of the three wavelet-based multivariate statistical models.

Journal ArticleDOI
TL;DR: The results showed that core collections constructed by LDSS strategy had a good representativeness of the initial collection, and suggested that standardized Euclidean distance was an appropriate genetic distance for constructing core collections in this strategy.
Abstract: A strategy was proposed for constructing core collections by least distance stepwise sampling (LDSS) based on genotypic values. In each procedure of cluster, the sampling is performed in the subgroup with the least distance in the dendrogram during constructing a core collection. Mean difference percentage (MD), variance difference percentage (VD), coincidence rate of range (CR) and variable rate of coefficient of variation (VR) were used to evaluate the representativeness of core collections constructed by this strategy. A cotton germplasm collection of 1,547 accessions with 18 quantitative traits was used to construct core collections. Genotypic values of all quantitative traits of the cotton collection were unbiasedly predicted based on mixed linear model approach. By three sampling percentages (10, 20 and 30%), four genetic distances (city block distance, Euclidean distance, standardized Euclidean distance and Mahalanobis distance) combining four hierarchical cluster methods (nearest distance method, furthest distance method, unweighted pair-group average method and Ward’s method) were adopted to evaluate the property of this strategy. Simulations were conducted in order to draw consistent, stable and reproducible results. The principal components analysis was performed to validate this strategy. The results showed that core collections constructed by LDSS strategy had a good representativeness of the initial collection. As compared to the control strategy (stepwise clusters with random sampling strategy), LDSS strategy could construct more representative core collections. For LDSS strategy, cluster methods did not need to be considered because all hierarchical cluster methods could give same results completely. The results also suggested that standardized Euclidean distance was an appropriate genetic distance for constructing core collections in this strategy.

Journal ArticleDOI
TL;DR: This work partitions errors of imputation derived from similar observation units as arising from three sources: observation error, the distribution of observation units with respect to their similarity, and pure error given a particular choice of variables known for all observation units.
Abstract: Imputation is applied for two quite different purposes: to supply missing data to complete a data set for subsequent modeling analyses or to estimate subpopulation totals. Error properties of the imputed values have different effects in these two contexts. We partition errors of imputation derived from similar observation units as arising from three sources: observation error, the distribution of observation units with respect to their similarity, and pure error given a particular choice of variables known for all observation units. Two new statistics based on this partitioning measure the accuracy of the imputations, facilitating comparison of imputation to alternative methods of estimation such as regression and comparison of alternative methods of imputation generally. Knowing the relative magnitude of the errors arising from these partitions can also guide efficient investment in obtaining additional data. We illustrate this partitioning using three extensive data sets from western North America. Application of this partitioning to compare near-neighbor imputation is illustrated for Mahalanobis- and two canonical correlation-based measures of similarity.

Journal ArticleDOI
TL;DR: The forward search provides a series of robust parameter estimates based on increasing numbers of observations, which are used to cluster multivariate normal data and compare with mclust and k-means clustering.

Proceedings ArticleDOI
12 Nov 2007
TL;DR: A new in-vehicle real-time vehicle detection strategy which hypothesizes the presence of vehicles in rectangular sub-regions based on the robust classification of features vectors result of a combination of multiple morphological vehicle features is presented.
Abstract: This paper presents a new in-vehicle real-time vehicle detection strategy which hypothesizes the presence of vehicles in rectangular sub-regions based on the robust classification of features vectors result of a combination of multiple morphological vehicle features. One vector is extracted for each region of the image likely containing vehicles as a multidimensional likelihood measure with respect to a simplified vehicle model. A supervised training phase set the representative vectors of the classes vehicle and non-vehicle, so that the hypothesis is verified or not according to the Mahalanobis distance between the feature vector and the representative vectors. Excellent results have been obtained in several video sequences accurately detecting vehicles with very different aspect-ratio, color, size, etc, while minimizing the number of missing detections and false alarms.

Journal ArticleDOI
TL;DR: This work exposes a more severe form of vulnerability than a hill climbing kind of attack where incrementally different versions of the same face are used, and the ability of the proposed approach to reconstruct the actual face templates of the users increases privacy concerns in biometric systems.
Abstract: Regeneration of templates from match scores has security and privacy implications related to any biometric authentication system. We propose a novel paradigm to reconstruct face templates from match scores using a linear approach. It proceeds by first modeling the behavior of the given face recognition algorithm by an affine transformation. The goal of the modeling is to approximate the distances computed by a face recognition algorithm between two faces by distances between points, representing these faces, in an affine space. Given this space, templates from an independent image set (break-in) are matched only once with the enrolled template of the targeted subject and match scores are recorded. These scores are then used to embed the targeted subject in the approximating affine (nonorthogonal) space. Given the coordinates of the targeted subject in the affine space, the original template of the targeted subject is reconstructed using the inverse of the affine transformation. We demonstrate our ideas using three fundamentally different face recognition algorithms: principal component analysis (PCA) with Mahalanobis cosine distance measure, Bayesian intra-extrapersonal classifier (BIC), and a feature-based commercial algorithm. To demonstrate the independence of the break-in set with the gallery set, we select face templates from two different databases: the face recognition grand challenge (FRGC) database and the facial recognition technology (FERET) database. With an operational point set at 1 percent false acceptance rate (FAR) and 99 percent true acceptance rate (TAR) for 1,196 enrollments (FERET gallery), we show that at most 600 attempts (score computations) are required to achieve a 73 percent chance of breaking in as a randomly chosen target subject for the commercial face recognition system. With a similar operational setup, we achieve a 72 percent and 100 percent chance of breaking in for the Bayesian and PCA-based face recognition systems, respectively. With three different levels of score quantization, we achieve 69 percent, 68 percent, and 49 percent probability of break-in, indicating the robustness of our proposed scheme to score quantization. We also show that the proposed reconstruction scheme has 47 percent more probability of breaking in as a randomly chosen target subject for the commercial system as compared to a hill climbing approach with the same number of attempts. Given that the proposed template reconstruction method uses distinct face templates to reconstruct faces, this work exposes a more severe form of vulnerability than a hill climbing kind of attack where incrementally different versions of the same face are used. Also, the ability of the proposed approach to reconstruct the actual face templates of the users increases privacy concerns in biometric systems.

Journal Article
TL;DR: In this paper, the authors compare the ability of the Mahalanobis-Taguchi System and a neural-network to discriminate using small data sets, and examine the discriminant ability as a function of data set size.
Abstract: The Mahalanobis-Taguchi System is a diagnosis and predictive method for analyzing patterns in multivariate cases. The goal of this study is to compare the ability of the Mahalanobis- Taguchi System and a neural-network to discriminate using small data sets. We examine the discriminant ability as a function of data set size using an application area where reliable data is publicly available. The study uses the Wisconsin Breast Cancer study with nine attributes and one class.

Journal ArticleDOI
TL;DR: In this paper, the authors examined the ability of measurement scale of Mahalanobis-Taguchi system in classifying the steel plate as “OK” or “Diverted”.

Journal ArticleDOI
29 Apr 2007
TL;DR: This paper presents an adaptive Multi-level Mahalanobis-based Dimensionality Reduction (MMDR) technique for high-dimensional indexing that not only achieves higher precision, but also enables queries to be processed efficiently.
Abstract: The notorious “dimensionality curse” is a well-known phenomenon for any multi-dimensional indexes attempting to scale up to high dimensions. One well-known approach to overcome degradation in performance with respect to increasing dimensions is to reduce the dimensionality of the original dataset before constructing the index. However, identifying the correlation among the dimensions and effectively reducing them are challenging tasks. In this paper, we present an adaptive Multi-level Mahalanobis-based Dimensionality Reduction (MMDR) technique for high-dimensional indexing. Our MMDR technique has four notable features compared to existing methods. First, it discovers elliptical clusters for more effective dimensionality reduction by using only the low-dimensional subspaces. Second, data points in the different axis systems are indexed using a single B+-tree. Third, our technique is highly scalable in terms of data size and dimension. Finally, it is also dynamic and adaptive to insertions. An extensive performance study was conducted using both real and synthetic datasets, and the results show that our technique not only achieves higher precision, but also enables queries to be processed efficiently.

Proceedings ArticleDOI
01 Sep 2007
TL;DR: The results show the feasibility of classification based on the multilead wavelet features, although further development is needed in subset selection and classification algorithms.
Abstract: The objective of this work is to develop a model for ECG classification based on multilead features. The MIT-BIH Arrhythmia database was used following AAMI recommendations and class labeling. We used for classification classical features as well as features extracted from different scales of the wavelet decomposition of both leads integrated in an RMS manner. Step-wise and a randomized method were considered for feature subset selection, and linear discriminant analysis (LDA) was also used for additional dimensional reduction. Three classifiers: linear, quadratic and Mahalanobis distance were evaluated, using a k-fold like cross validation scheme. Results in the training set showed that the best performance was obtained with a 28-feature subset, using LDA and a Mahalanobis distance classifier. This model was evaluated in the test dataset with the following performance measurements global accuracy: 86%; for supraventricular beats, Sensitivity: 86%, Positive pred.: 20%; for ventricular beats Sensitivity: 71%, Positive pred.: 61%. This results show the feasibility of classification based on the multilead wavelet features, although further development is needed in subset selection and classification algorithms.

Journal ArticleDOI
TL;DR: A novel multivariate approach to evaluate the quality of an array that examines the 'Mahalanobis distance' of its quality attributes from those of other arrays is proposed and computing these distances on subsets of the quality measures in the report may increase the method's ability to detect unusual arrays and helps to identify possible reasons of thequality problems.
Abstract: Motivation: The process of producing microarray data involves multiple steps, some of which may suffer from technical problems and seriously damage the quality of the data. Thus, it is essential to identify those arrays with low quality. This article addresses two questions: (1) how to assess the quality of a microarray dataset using the measures provided in quality control (QC) reports; (2) how to identify possible sources of the quality problems. Results: We propose a novel multivariate approach to evaluate the quality of an array that examines the ‘Mahalanobis distance’ of its quality attributes from those of other arrays. Thus, we call it Mahalanobis Distance Quality Control (MDQC) and examine different approaches of this method. MDQC flags problematic arrays based on the idea of outlier detection, i.e. it flags those arrays whose quality attributes jointly depart from those of the bulk of the data. Using two case studies, we show that a multivariate analysis gives substantially richer information than analyzing each parameter of the QC report in isolation. Moreover, once the QC report is produced, our quality assessment method is computationally inexpensive and the results can be easily visualized and interpreted. Finally, we show that computing these distances on subsets of the quality measures in the report may increase the method’s ability to detect unusual arrays and helps to identify possible reasons of the quality problems. Availability: The library to implement MDQC will soon be available from Bioconductor Contact: gcohen@mrl.ubc.ca Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: In this article, principal component analysis (PCA) and Fisher discriminant analysis (FDA) are presented to detect and diagnose the single sensor fault with fixed bias occurring in variable air volume systems.
Abstract: Principal component analysis (PCA) and Fisher discriminant analysis (FDA) are presented in this paper to detect and diagnose the single sensor fault with fixed bias occurring in variable air volume systems. Based on the energy balance and the flow-pressure balance, both related to physical models of the systems, two PCA models are built to detect the occurrence of abnormalities in the systems. In addition, FDA, a linear dimensionality reduction technique, is developed to diagnose the fault source. Through the Fisher transformation, different faulty operation data classes can be optimally separated by maximizing the scatter between classes while minimizing the scatter within classes. Then the faulty sensor can be isolated through comparing Mahalanobis distances of the candidate sensors.

Proceedings ArticleDOI
20 Jun 2007
TL;DR: The suggested similarity function exhibits superior performance over alternative Mahalanobis distances learnt from the same data, and is demonstrated in the context of image retrieval and graph based clustering, using a large number of data sets.
Abstract: We consider the problem of learning a similarity function from a set of positive equivalence constraints, i.e. 'similar' point pairs. We define the similarity in information theoretic terms, as the gain in coding length when shifting from independent encoding of the pair to joint encoding. Under simple Gaussian assumptions, this formulation leads to a non-Mahalanobis similarity function which is efficient and simple to learn. This function can be viewed as a likelihood ratio test, and we show that the optimal similarity-preserving projection of the data is a variant of Fisher Linear Discriminant. We also show that under some naturally occurring sampling conditions of equivalence constraints, this function converges to a known Mahalanobis distance (RCA). The suggested similarity function exhibits superior performance over alternative Mahalanobis distances learnt from the same data. Its superiority is demonstrated in the context of image retrieval and graph based clustering, using a large number of data sets.