scispace - formally typeset
Search or ask a question
Topic

Linear discriminant analysis

About: Linear discriminant analysis is a research topic. Over the lifetime, 18361 publications have been published within this topic receiving 603195 citations. The topic is also known as: Linear discriminant analysis & LDA.


Papers
More filters
Journal ArticleDOI
TL;DR: The power analysis algorithm calculates the appropriate sample size for discrimination of phenotypic subtypes in a reduced dimensional space obtained by Fisher discriminant analysis (FDA), and it was confirmed that when the minimum number of samples estimated from power analysis is used, group means in the FDA discrimination space are statistically different.
Abstract: Motivation: Transcriptional profiling using microarrays can reveal important information about cellular and tissue expression phenotypes, but these measurements are costly and time consuming. Additionally, tissue sample availability poses further constraints on the number of arrays that can be analyzed in connection with a particular disease or state of interest. It is therefore important to provide a method for the determination of the minimum number of microarrays required to separate, with statistical reliability, distinct disease states or other physiological differences. Results: Power analysis was applied to estimate the minimum sample size required for two-class and multi-class discrimination. The power analysis algorithm calculates the appropriate sample size for discrimination of phenotypic subtypes in a reduced dimensional space obtained by Fisher discriminant analysis (FDA). This approach was tested by applying the algorithm to existing data sets for estimation of the minimum sample size required for drawing certain conclusions on multi-class distinction with statistical reliability. It was confirmed that when the minimum number of samples estimated from power analysis is used, group means in the FDA discrimination space are statistically different.

142 citations

Journal ArticleDOI
TL;DR: Different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation are analysed in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) of biological interpretation.
Abstract: Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.

142 citations

Journal ArticleDOI
TL;DR: The combination of classifiers leads to substantial reduction of misclassification error in a wide range of applications and benchmark problems and the procedure performs comparable to the best classifiers used in a number of artificial examples and applications.

141 citations

Journal ArticleDOI
TL;DR: A number of methods have been proposed in the last decade to overcome the limitation of LDA on small sample size, and these methods, in applying to face recognition, can be roughly grouped into three categories.

141 citations

Journal ArticleDOI
TL;DR: This work develops functional principal components analysis for this situation and demonstrates the prediction of individual trajectories from sparse observations and can handle missing data and lead to predictions of the functional principal component scores which serve as random effects in this model.
Abstract: Summary In longitudinal data analysis one frequently encounters non-Gaussian data that are repeatedly collected for a sample of individuals over time The repeated observations could be binomial, Poisson or of another discrete type or could be continuousThe timings of the repeated measurements are often sparse and irregular We introduce a latent Gaussian process model for such data, establishing a connection to functional data analysis The functional methods proposed are non-parametric and computationally straightforward as they do not involve a likelihood We develop functional principal components analysis for this situation and demonstrate the prediction of individual trajectories from sparse observations This method can handle missing data and leads to predictions of the functional principal component scores which serve as random effects in this modelThese scores can then be used for further statistical analysis, such as inference, regression, discriminant analysis or clustering We illustrate these non-parametric methods with longitudinal data on primary biliary cirrhosis and show in simulations that they are competitive in comparisons with generalized estimating equations and generalized linear mixed models

141 citations


Network Information
Related Topics (5)
Regression analysis
31K papers, 1.7M citations
85% related
Artificial neural network
207K papers, 4.5M citations
80% related
Feature extraction
111.8K papers, 2.1M citations
80% related
Cluster analysis
146.5K papers, 2.9M citations
79% related
Image segmentation
79.6K papers, 1.8M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20251
20242
2023756
20221,711
2021678
2020815