scispace - formally typeset
Search or ask a question

Showing papers on "Linear discriminant analysis published in 2017"


Journal ArticleDOI
TL;DR: A solid intuition is built for what is LDA, and how LDA works, thus enabling readers of all levels to get a better understanding of the LDA and to know how to apply this technique in different applications.
Abstract: Linear Discriminant Analysis (LDA) is a very common technique for dimensionality reduction problems as a preprocessing step for machine learning and pattern classification applications. At the same time, it is usually used as a black box, but (sometimes) not well understood. The aim of this paper is to build a solid intuition for what is LDA, and how LDA works, thus enabling readers of all levels be able to get a better understanding of the LDA and to know how to apply this technique in different applications. The paper first gave the basic definitions and steps of how LDA technique works supported with visual explanations of these steps. Moreover, the two methods of computing the LDA space, i.e. class-dependent and class-independent methods, were explained in details. Then, in a step-by-step approach, two numerical examples are demonstrated to show how the LDA space can be calculated in case of the class-dependent and class-independent methods. Furthermore, two of the most common LDA problems (i.e. Small Sample Size (SSS) and non-linearity problems) were highlighted and illustrated, and state-of-the-art solutions to these problems were investigated and explained. Finally, a number of experiments was conducted with different datasets to (1) investigate the effect of the eigenvectors that used in the LDA space on the robustness of the extracted feature for the classification accuracy, and (2) to show when the SSS problem occurs and how it can be addressed.

518 citations


Journal ArticleDOI
TL;DR: This study tests machine learning models to predict bankruptcy one year prior to the event, and finds that bagging, boosting, and random forest models outperform the others techniques, and that all prediction accuracy in the testing sample improves when the additional variables are included.
Abstract: Machine learning models show improved bankruptcy prediction accuracy over traditional models.Various models were tested using different accuracy metrics.Boosting, bagging, and random forest models provide better results. There has been intensive research from academics and practitioners regarding models for predicting bankruptcy and default events, for credit risk management. Seminal academic research has evaluated bankruptcy using traditional statistics techniques (e.g. discriminant analysis and logistic regression) and early artificial intelligence models (e.g. artificial neural networks). In this study, we test machine learning models (support vector machines, bagging, boosting, and random forest) to predict bankruptcy one year prior to the event, and compare their performance with results from discriminant analysis, logistic regression, and neural networks. We use data from 1985 to 2013 on North American firms, integrating information from the Salomon Center database and Compustat, analysing more than 10,000 firm-year observations. The key insight of the study is a substantial improvement in prediction accuracy using machine learning techniques especially when, in addition to the original Altmans Z-score variables, we include six complementary financial indicators. Based on Carton and Hofer (2006), we use new variables, such as the operating margin, change in return-on-equity, change in price-to-book, and growth measures related to assets, sales, and number of employees, as predictive variables. Machine learning models show, on average, approximately 10% more accuracy in relation to traditional models. Comparing the best models, with all predictive variables, the machine learning technique related to random forest led to 87% accuracy, whereas logistic regression and linear discriminant analysis led to 69% and 50% accuracy, respectively, in the testing sample. We find that bagging, boosting, and random forest models outperform the others techniques, and that all prediction accuracy in the testing sample improves when the additional variables are included. Our research adds to the discussion of the continuing debate about superiority of computational methods over statistical techniques such as in Tsai, Hsu, and Yen (2014) and Yeh, Chi, and Lin (2014). In particular, for machine learning mechanisms, we do not find SVM to lead to higher accuracy rates than other models. This result contradicts outcomes from Danenas and Garsva (2015) and Cleofas-Sanchez, Garcia, Marques, and Senchez (2016), but corroborates, for instance, Wang, Ma, and Yang (2014), Liang, Lu, Tsai, and Shih (2016), and Cano etal. (2017). Our study supports the applicability of the expert systems by practitioners as in Heo and Yang (2014), Kim, Kang, and Kim (2015) and Xiao, Xiao, and Wang (2016).

430 citations


Book ChapterDOI
01 Jan 2017
TL;DR: This chapter describes a solution that applies a linear transformation to source features to align them with target features before classifier training, and proposes to equivalently apply CORAL to the classifier weights, leading to added efficiency when the number of classifiers is small but the number and dimensionality of target examples are very high.
Abstract: In this chapter, we present CORrelation ALignment (CORAL), a simple yet effective method for unsupervised domain adaptation. CORAL minimizes domain shift by aligning the second-order statistics of source and target distributions, without requiring any target labels. In contrast to subspace manifold methods, it aligns the original feature distributions of the source and target domains, rather than the bases of lower-dimensional subspaces. It is also much simpler than other distribution matching methods. CORAL performs remarkably well in extensive evaluations on standard benchmark datasets. We first describe a solution that applies a linear transformation to source features to align them with target features before classifier training. For linear classifiers, we propose to equivalently apply CORAL to the classifier weights, leading to added efficiency when the number of classifiers is small but the number and dimensionality of target examples are very high. The resulting CORAL Linear Discriminant Analysis (CORAL-LDA) outperforms LDA by a large margin on standard domain adaptation benchmarks. Finally, we extend CORAL to learn a nonlinear transformation that aligns correlations of layer activations in deep neural networks (DNNs). The resulting Deep CORAL approach works seamlessly with DNNs and achieves state-of-the-art performance on standard benchmark datasets. Our code is available at: https://github.com/VisionLearningGroup/CORAL.

271 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a hybrid deep architecture which combines Fisher vectors and deep neural networks to learn non-linear transformations of pedestrian images to a deep space where data can be linearly separable.

203 citations


Journal ArticleDOI
TL;DR: Experimental results indicate that the SBLFB method is promising for development of an effective classifier to improve MI classification.
Abstract: Effective common spatial pattern (CSP) feature extraction for motor imagery (MI) electroencephalogram (EEG) recordings usually depends on the filter band selection to a large extent. Subband optimization has been suggested to enhance classification accuracy of MI. Accordingly, this study introduces a new method that implements sparse Bayesian learning of frequency bands (named SBLFB) from EEG for MI classification. CSP features are extracted on a set of signals that are generated by a filter bank with multiple overlapping subbands from raw EEG data. Sparse Bayesian learning is then exploited to implement selection of significant features with a linear discriminant criterion for classification. The effectiveness of SBLFB is demonstrated on the BCI Competition IV IIb dataset, in comparison with several other competing methods. Experimental results indicate that the SBLFB method is promising for development of an effective classifier to improve MI classification.

185 citations


Journal ArticleDOI
TL;DR: Gaussian process (GP)-based classification technique using three kernels namely: linear, polynomial and radial basis kernel is adapted and investigated in comparison to existing techniques such as LDA, QDA and NB.

164 citations


Journal ArticleDOI
TL;DR: The results show that the proposed method substantially improves the emotion recognition rate with respect to the commonly used spectral power band method.
Abstract: Recent advancements in human–computer interaction research have led to the possibility of emotional communication via brain–computer interface systems for patients with neuropsychiatric disorders or disabilities. In this paper, we efficiently recognize emotional states by analyzing the features of electroencephalography (EEG) signals, which are generated from EEG sensors that noninvasively measure the electrical activity of neurons inside the human brain, and select the optimal combination of these features for recognition. In this paper, the scalp EEG data of 21 healthy subjects (12–14 years old) were recorded using a 14-channel EEG machine while the subjects watched images with four types of emotional stimuli (happy, calm, sad, or scared). After preprocessing, the Hjorth parameters (activity, mobility, and complexity) were used to measure the signal activity of the time series data. We selected the optimal EEG features using a balanced one-way ANOVA after calculating the Hjorth parameters for different frequency ranges. Features selected by this statistical method outperformed univariate and multivariate features. The optimal features were further processed for emotion classification using support vector machine, k-nearest neighbor, linear discriminant analysis, Naive Bayes, random forest, deep learning, and four ensembles methods (bagging, boosting, stacking, and voting). The results show that the proposed method substantially improves the emotion recognition rate with respect to the commonly used spectral power band method.

156 citations


Journal ArticleDOI
01 Aug 2017
TL;DR: Experimental results on the international public Bonn epilepsy EEG dataset show that the average classification accuracy of the presented approach are equal to or higher than 98.10% in all the five cases, and this indicates the effectiveness of the proposed approach for automated seizure detection.
Abstract: Achieving the goal of detecting seizure activity automatically using electroencephalogram (EEG) signals is of great importance and significance for the treatment of epileptic seizures. To realize this aim, a newly-developed time-frequency analytical algorithm, namely local mean decomposition (LMD), is employed in the presented study. LMD is able to decompose an arbitrary signal into a series of product functions (PFs). Primarily, the raw EEG signal is decomposed into several PFs, and then the temporal statistical and non-linear features of the first five PFs are calculated. The features of each PF are fed into five classifiers, including back propagation neural network (BPNN), K-nearest neighbor (KNN), linear discriminant analysis (LDA), un-optimized support vector machine (SVM) and SVM optimized by genetic algorithm (GA-SVM), for five classification cases, respectively. Confluent features of all PFs and raw EEG are further passed into the high-performance GA-SVM for the same classification tasks. Experimental results on the international public Bonn epilepsy EEG dataset show that the average classification accuracy of the presented approach are equal to or higher than 98.10% in all the five cases, and this indicates the effectiveness of the proposed approach for automated seizure detection.

152 citations


Journal ArticleDOI
TL;DR: A novel dimensionality reduction algorithm, locality adaptive discriminant analysis (LADA) for HSI classification that aims to learn a representative subspace of data, and focuses on the data points with close relationship in spectral and spatial domains.
Abstract: Linear discriminant analysis (LDA) is a popular technique for supervised dimensionality reduction, but with less concern about a local data structure. This makes LDA inapplicable to many real-world situations, such as hyperspectral image (HSI) classification. In this letter, we propose a novel dimensionality reduction algorithm, locality adaptive discriminant analysis (LADA) for HSI classification. The proposed algorithm aims to learn a representative subspace of data, and focuses on the data points with close relationship in spectral and spatial domains. An intuitive motivation is that data points of the same class have similar spectral feature and the data points among spatial neighborhood are usually associated with the same class. Compared with traditional LDA and its variants, LADA is able to adaptively exploit the local manifold structure of data. Experiments carried out on several real hyperspectral data sets demonstrate the effectiveness of the proposed method.

145 citations


Journal ArticleDOI
TL;DR: A new criterion to maximize the weighted harmonic mean of trace ratios is proposed, which effectively avoid the domination problem while did not raise any difficulties in the formulation and consistently outperforms other compared methods on all of the datasets.
Abstract: Linear discriminant analysis (LDA) is one of the most important supervised linear dimensional reduction techniques which seeks to learn low-dimensional representation from the original high-dimensional feature space through a transformation matrix, while preserving the discriminative information via maximizing the between-class scatter matrix and minimizing the within class scatter matrix. However, the conventional LDA is formulated to maximize the arithmetic mean of trace ratios which suffers from the domination of the largest objectives and might deteriorate the recognition accuracy in practical applications with a large number of classes. In this paper, we propose a new criterion to maximize the weighted harmonic mean of trace ratios, which effectively avoid the domination problem while did not raise any difficulties in the formulation. An efficient algorithm is exploited to solve the proposed challenging problems with fast convergence, which might always find the globally optimal solution just using eigenvalue decomposition in each iteration. Finally, we conduct extensive experiments to illustrate the effectiveness and superiority of our method over both of synthetic datasets and real-life datasets for various tasks, including face recognition, human motion recognition and head pose recognition. The experimental results indicate that our algorithm consistently outperforms other compared methods on all of the datasets.

138 citations


Journal ArticleDOI
TL;DR: Time domain features, namely, waveform length (WL), slope sign changes (SSC), simple sign integral and Wilson amplitude for the first time in addition to established mean absolute value and zero crossing (ZC) for identification of mechanical faults of induction motor are attempted.
Abstract: The frequency of rolling element failures in induction motor is high and may lead to losses due to sudden downtime of machine. Researchers are fervent to identify an effective fault diagnosing scheme with less computational burden using optimum number of good discriminating features. We attempted time domain features, namely, waveform length (WL), slope sign changes (SSC), simple sign integral and Wilson amplitude for the first time in addition to established mean absolute value and zero crossing (ZC) for identification of mechanical faults of induction motor. Ten data sets are derived from publicly available vibration database of Case Western Reserve University to identify the capability of features in identification of faults under various conditions. The results are compared with six conventional features for tenfold cross validation using linear discriminant analysis, naive Bayes, and support vector machine classifiers. The results have shown that WL, WAMP, ZC, and SSC outperform other features. Furthermore, area under receiver operator characteristics curve analyses showed an average of 0.9987 with the proposed statistical features and 0.97618 with six conventional features. We also attempted to study the effect of data length and percentage of overlap in classification and found accuracy improves with increase in length but not significant beyond the window length of 3000 with 50% overlap. The proposed statistical features are validated using the brute force method and Laplaician method of feature selection and shown an average accuracy rate of 0.9936 and 0.9894, respectively.

Journal ArticleDOI
01 Feb 2017
TL;DR: A novel two-tier classification models based on machine learning approaches Naïve Bayes, certainty factor voting version of KNN classifiers and also Linear Discriminant Analysis for dimension reduction, which provides low computation time and good detection rate against rare and complex attack types.
Abstract: Network anomaly detection is one of the most challenging fields in cyber security. Most of the proposed techniques have high computation complexity or based on heuristic approaches. This paper proposes a novel two-tier classification models based on machine learning approaches Naive Bayes, certainty factor voting version of KNN classifiers and also Linear Discriminant Analysis for dimension reduction. Experimental results show a desirable and promising gain in detection rate and false alarm compared with other existing models. The model also trained by two generated balance training sets using SMOTE method to evaluate the chosen similarity measure for dealing with imbalanced network anomaly data sets. The two-tier model provides low computation time due to optimal dimension reduction and feature selection, as well as good detection rate against rare and complex attack types which are so dangerous because of their close similarity to normal behaviors like User to Root and Remote to Local. All evaluation processes experimented by NSL-KDD data set.

Journal ArticleDOI
TL;DR: A two-layer learning method for driving behavior recognition using EEG data that shows a significant correlation between EEG patterns and car-following behavior is proposed.

Journal ArticleDOI
TL;DR: ‘new age’ classifiers in corporate bankruptcy modelling are recommended because they predict significantly better than all other classifiers on both the cross-sectional and longitudinal test samples and the models may have considerable practical appeal.
Abstract: Corporate bankruptcy prediction has attracted significant research attention from business academics, regulators and financial economists over the past five decades (Altman, 2002). However, much of this literature has relied on quite simplistic classifiers such as logistic regression and linear discrminant analysis (LDA) (Jones and Hensher, 2008). Based on a large sample of US corporate bankruptcies, we examine the predictive performance of 16 classifiers, ranging from the most restrictive classifiers (such as logit, probit and linear discriminant analysis) to more advanced techniques such as neural networks, support vector machines (SVMs) and ‘new age’ statistical learning models including generalised boosting, AdaBoost and random forests. Consistent with the findings of Jones et al., (2015), we show that quite simple classifiers such as logit and LDA perform reasonably well in bankruptcy prediction. However, we recommend the use of ‘new age’ classifiers in corporate bankruptcy modelling because: (1) they predict significantly better than all other classifiers on both the cross-sectional and longitudinal test samples; (2) the models may have considerable practical appeal because they are relatively easy to estimate and implement (for instance, they require minimal researcher intervention for data preparation, variable selection and model architecture specification); and (3) while the underlying model structures can be very complex, we demonstrate that ‘new age’ classifiers have a reasonably good level of interpretability through such metrics as relative variable importances (RVIs). This article is protected by copyright. All rights reserved

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This work has attempted to design a generalized model for recognition of three fundamental movements of the human forearm performed in daily life where data is collected from four different subjects using a single wrist worn accelerometer sensor.
Abstract: In recent years, significant advancements have taken place in human activity recognition using various machine learning approaches. However, feature engineering have dominated conventional methods involving the difficult process of optimal feature selection. This problem has been mitigated by using a novel methodology based on deep learning framework which automatically extracts the useful features and reduces the computational cost. As a proof of concept, we have attempted to design a generalized model for recognition of three fundamental movements of the human forearm performed in daily life where data is collected from four different subjects using a single wrist worn accelerometer sensor. The validation of the proposed model is done with different pre-processing and noisy data condition which is evaluated using three possible methods. The results show that our proposed methodology achieves an average recognition rate of 99.8% as opposed to conventional methods based on K-means clustering, linear discriminant analysis and support vector machine.

Journal ArticleDOI
27 May 2017-Sensors
TL;DR: The thorough quantitative comparison of the features and classifiers in this study supports the feasibility of a wireless, wearable sEMG sensor system for automatic activity monitoring and fall detection.
Abstract: As an essential subfield of context awareness, activity awareness, especially daily activity monitoring and fall detection, plays a significant role for elderly or frail people who need assistance in their daily activities. This study investigates the feature extraction and pattern recognition of surface electromyography (sEMG), with the purpose of determining the best features and classifiers of sEMG for daily living activities monitoring and fall detection. This is done by a serial of experiments. In the experiments, four channels of sEMG signal from wireless, wearable sensors located on lower limbs are recorded from three subjects while they perform seven activities of daily living (ADL). A simulated trip fall scenario is also considered with a custom-made device attached to the ankle. With this experimental setting, 15 feature extraction methods of sEMG, including time, frequency, time/frequency domain and entropy, are analyzed based on class separability and calculation complexity, and five classification methods, each with 15 features, are estimated with respect to the accuracy rate of recognition and calculation complexity for activity monitoring and fall detection. It is shown that a high accuracy rate of recognition and a minimal calculation time for daily activity monitoring and fall detection can be achieved in the current experimental setting. Specifically, the Wilson Amplitude (WAMP) feature performs the best, and the classifier Gaussian Kernel Support Vector Machine (GK-SVM) with Permutation Entropy (PE) or WAMP results in the highest accuracy for activity monitoring with recognition rates of 97.35% and 96.43%. For fall detection, the classifier Fuzzy Min-Max Neural Network (FMMNN) has the best sensitivity and specificity at the cost of the longest calculation time, while the classifier Gaussian Kernel Fisher Linear Discriminant Analysis (GK-FDA) with the feature WAMP guarantees a high sensitivity (98.70%) and specificity (98.59%) with a short calculation time (65.586 ms), making it a possible choice for pre-impact fall detection. The thorough quantitative comparison of the features and classifiers in this study supports the feasibility of a wireless, wearable sEMG sensor system for automatic activity monitoring and fall detection.

Journal ArticleDOI
TL;DR: This paper proposes a non-greedy iterative algorithm to solve the trace ratio form of L1-norm-based linear discriminant analysis and demonstrates that the proposed algorithm can maximize the objective function value and is superior to most existing L 1-LDA algorithms.
Abstract: Recently, L1-norm-based discriminant subspace learning has attracted much more attention in dimensionality reduction and machine learning However, most existing approaches solve the column vectors of the optimal projection matrix one by one with greedy strategy Thus, the obtained optimal projection matrix does not necessarily best optimize the corresponding trace ratio objective function, which is the essential criterion function for general supervised dimensionality reduction In this paper, we propose a non-greedy iterative algorithm to solve the trace ratio form of L1-norm-based linear discriminant analysis We analyze the convergence of our proposed algorithm in detail Extensive experiments on five popular image databases illustrate that our proposed algorithm can maximize the objective function value and is superior to most existing L1-LDA algorithms

Journal ArticleDOI
TL;DR: In this paper, a new feature extraction step that combines the classical wavelet packet decomposition energy distribution technique and a feature extraction technique based on the selection of the most impulsive frequency bands is presented.

Journal ArticleDOI
TL;DR: A patient-specific epileptic seizure predication method relying on the common spatial pattern- (CSP-) based feature extraction of scalp electroencephalogram (sEEG) signals to train a linear discriminant analysis classifier, which is then employed in the testing phase.
Abstract: This paper presents a patient-specific epileptic seizure predication method relying on the common spatial pattern- (CSP-) based feature extraction of scalp electroencephalogram (sEEG) signals. Multichannel EEG signals are traced and segmented into overlapping segments for both preictal and interictal intervals. The features extracted using CSP are used for training a linear discriminant analysis classifier, which is then employed in the testing phase. A leave-one-out cross-validation strategy is adopted in the experiments. The experimental results for seizure prediction obtained from the records of 24 patients from the CHB-MIT database reveal that the proposed predictor can achieve an average sensitivity of 0.89, an average false prediction rate of 0.39, and an average prediction time of 68.71 minutes using a 120-minute prediction horizon.

Journal ArticleDOI
TL;DR: The comparison pointed out that for people with trans-radial amputation the algorithm that produces the best compromise is NLR closely followed by MLP, and this result was also confirmed by the comparison with LDA with time domain features, which provided not significant differences of performance and computational burden between NLR and LDA.
Abstract: Currently, the typically adopted hand prosthesis surface electromyography (sEMG) control strategies do not provide the users with a natural control feeling and do not exploit all the potential of commercially available multi-fingered hand prostheses. Pattern recognition and machine learning techniques applied to sEMG can be effective for a natural control based on the residual muscles contraction of amputated people corresponding to phantom limb movements. As the researches has reached an advanced grade accuracy, these algorithms have been proved and the embedding is necessary for the realization of prosthetic devices. The aim of this work is to provide engineering tools and indications on how to choose the most suitable classifier, and its specific internal settings for an embedded control of multigrip hand prostheses. By means of an innovative statistical analysis, we compare 4 different classifiers: Nonlinear Logistic Regression, Multi-Layer Perceptron, Support Vector Machine and Linear Discriminant Analysis, which was considered as ground truth. Experimental tests have been performed on sEMG data collected from 30 people with trans-radial amputation, in which the algorithms were evaluated for both performance and computational burden, then the statistical analysis has been based on the Wilcoxon Signed-Rank test and statistical significance was considered at p < 0.05. The comparative analysis among NLR, MLP and SVM shows that, for either classification performance and for the number of classification parameters, SVM attains the highest values followed by MLP, and then by NLR. However, using as unique constraint to evaluate the maximum acceptable complexity of each classifier one of the typically available memory of a high performance microcontroller, the comparison pointed out that for people with trans-radial amputation the algorithm that produces the best compromise is NLR closely followed by MLP. This result was also confirmed by the comparison with LDA with time domain features, which provided not significant differences of performance and computational burden between NLR and LDA. The proposed analysis would provide innovative engineering tools and indications on how to choose the most suitable classifier based on the application and the desired results for prostheses control.

Journal ArticleDOI
TL;DR: Introducing a wide sub-band and using mutual information for selecting the most discriminative sub-bands, the proposed method shows improvement in motor imagery EEG signal classification.
Abstract: Common spatial pattern (CSP) has been an effective technique for feature extraction in electroencephalography (EEG) based brain computer interfaces (BCIs). However, motor imagery EEG signal feature extraction using CSP generally depends on the selection of the frequency bands to a great extent. In this study, we propose a mutual information based frequency band selection approach. The idea of the proposed method is to utilize the information from all the available channels for effectively selecting the most discriminative filter banks. CSP features are extracted from multiple overlapping sub-bands. An additional sub-band has been introduced that cover the wide frequency band (7–30 Hz) and two different types of features are extracted using CSP and common spatio-spectral pattern techniques, respectively. Mutual information is then computed from the extracted features of each of these bands and the top filter banks are selected for further processing. Linear discriminant analysis is applied to the features extracted from each of the filter banks. The scores are fused together, and classification is done using support vector machine. The proposed method is evaluated using BCI Competition III dataset IVa, BCI Competition IV dataset I and BCI Competition IV dataset IIb, and it outperformed all other competing methods achieving the lowest misclassification rate and the highest kappa coefficient on all three datasets. Introducing a wide sub-band and using mutual information for selecting the most discriminative sub-bands, the proposed method shows improvement in motor imagery EEG signal classification.

Journal ArticleDOI
TL;DR: In this article, a two-step approach was proposed to predict formation permeability for a sandstone reservoir in the reservoir formations of Hassi R’Mel Field Southern from well log data using multivariate methods.

Journal ArticleDOI
TL;DR: A machine learning based solution to classify a sample as benign or malware with high accuracy and low computation overhead is proposed and empirical evidence indicates 98.4% classification accuracy in the 10-fold cross validation for the proposed integrated feature set.

Journal ArticleDOI
TL;DR: In this paper, the authors compared the accuracy of two approaches: traditional statistical techniques and machine learning techniques, which attempt to predict the failure of banks, and found that the artificial neural network and k-nearest neighbor methods are the most accurate.

Journal ArticleDOI
TL;DR: The R package assignPOP, which uses Monte-Carlo and K-fold cross-validation procedures, as well as principal component analysis (PCA), to estimate assignment accuracy and membership probabilities, can benefit any researcher who seeks to use genetic or non-genetic data to infer population structure and membership of individuals.
Abstract: Summary 1.The use of biomarkers (e.g., genetic, microchemical, and morphometric characteristics) to discriminate among and assign individuals to a population can benefit species conservation and management by facilitating our ability to understand population structure and demography. 2.Tools that can evaluate the reliability of large genomic datasets for population discrimination and assignment, as well as allow their integration with non-genetic markers for the same purpose, are lacking. Our R package, assignPOP, provides both functions in a supervised machine-learning framework. 3.assignPOP uses Monte-Carlo and K-fold cross-validation procedures, as well as principal component analysis (PCA), to estimate assignment accuracy and membership probabilities, using training (i.e., baseline source population) and test (i.e., validation) datasets that are independent. A user then can build a specified predictive model based on the relative sizes of these datasets and classification functions, including linear discriminant analysis, support vector machine, naive Bayes, decision tree, and random forest. 4.assignPOP can benefit any researcher who seeks to use genetic or non-genetic data to infer population structure and membership of individuals. assignPOP is a freely available R package under the GPL license, and can be downloaded from CRAN or at https://github.com/alexkychen/assignPOP. A comprehensive tutorial can also be found at https://alexkychen.github.io/assignPOP/. This article is protected by copyright. All rights reserved.

Journal ArticleDOI
TL;DR: Experiments on six benchmarking UCI datasets and two artificial datasets demonstrate that the proposed FDAF-score algorithm can not only obtain good results with fewer features than the original datasets as well as fast computation but also deal with the classification problem with noises well.
Abstract: The feature ranking method is discussed based on Fisher discriminate analysis (FDA) and F-score.The relative distribution of different classes is considered in the paper.The method removes all insignificant features at a time, so it can effectively reduce computational cost.The advantages of the proposed method are discussed. F-score is a simple feature selection technique, however, it works only for two classes. This paper proposes a novel feature ranking method based on Fisher discriminate analysis (FDA) and F-score, denoted as FDAF-score, which considers the relative distribution of classes in a multi-dimensional feature space. The main idea is that a proper subset is got according to maximizing the proportion of average between-class distance to the relative within-class scatter. Because the method removes all insignificant features at a time, it can effectively reduce computational cost. Experiments on six benchmarking UCI datasets and two artificial datasets demonstrate that the proposed FDAF-score algorithm can not only obtain good results with fewer features than the original datasets as well as fast computation but also deal with the classification problem with noises well.

Journal ArticleDOI
TL;DR: The predictive model developed in this study shows promise as a classification tool for stratifying bladder cancer into two staging categories: greater than or equal to stage T2 and belowStage T2.
Abstract: Purpose To evaluate the feasibility of using an objective computer aided system to assess bladder cancer stage in CT Urography (CTU). Materials and Methods A data set consisting of 84 bladder cancer lesions from 76 CTU cases was used to develop the computerized system for bladder cancer staging based on machine learning approaches. The cases were grouped into two classes based on pathological stage ≥T2 or below T2, which is the decision threshold for neoadjuvant chemotherapy treatment clinically. There were 43 cancers below stage T2 and 41 cancers at stage T2 or above. All 84 lesions were automatically segmented using our previously developed auto-initialized cascaded level sets (AI-CALS) method. Morphological and texture features were extracted. The features were divided into subspaces of morphological features only, texture features only, and a combined set of both morphological and texture features. The data set was split into Set 1 and Set 2 for two-fold cross validation. Stepwise feature selection was used to select the most effective features. A linear discriminant analysis (LDA), a neural network (NN), a support vector machine (SVM), and a random forest (RAF) classifier were used to combine the features into a single score. The classification accuracy of the four classifiers was compared using the area under the receiver operating characteristic (ROC) curve (Az). Results Based on the texture features only, the LDA classifier achieved a test Az of 0.91 on Set 1 and a test Az of 0.88 on Set 2. The test Az of the NN classifier for Set 1 and Set 2 were 0.89 and 0.92, respectively. The SVM classifier achieved test Az of 0.91 on Set 1 and test Az of 0.89 on Set 2. The test Az of the RAF classifier for Set 1 and Set 2 was 0.89 and 0.97, respectively. The morphological features alone, the texture features alone, and the combined feature set achieved comparable classification performance. Conclusion The predictive model developed in this study shows promise as a classification tool for stratifying bladder cancer into two staging categories: greater than or equal to stage T2 and below stage T2. This article is protected by copyright. All rights reserved.

Journal ArticleDOI
TL;DR: An improved spatial-spectral segmentation approach for the analysis of hyperspectral imaging data and its application for the prediction of powdery mildew infection levels (disease severity) of intact Chardonnay grape bunches shortly before veraison is presented.
Abstract: Hyperspectral imaging is an emerging means of assessing plant vitality, stress parameters, nutrition status, and diseases. Extraction of target values from the high-dimensional datasets either relies on pixel-wise processing of the full spectral information, appropriate selection of individual bands, or calculation of spectral indices. Limitations of such approaches are reduced classification accuracy, reduced robustness due to spatial variation of the spectral information across the surface of the objects measured as well as a loss of information intrinsic to band selection and use of spectral indices. In this paper we present an improved spatial-spectral segmentation approach for the analysis of hyperspectral imaging data and its application for the prediction of powdery mildew infection levels (disease severity) of intact Chardonnay grape bunches shortly before veraison. Instead of calculating texture features (spatial features) for the huge number of spectral bands independently, dimensionality reduction by means of Linear Discriminant Analysis (LDA) was applied first to derive a few descriptive image bands. Subsequent classification was based on modified Random Forest classifiers and selective extraction of texture parameters from the integral image representation of the image bands generated. Dimensionality reduction, integral images, and the selective feature extraction led to improved classification accuracies of up to $$0.998\pm 0.003$$ for detached berries used as a reference sample (training dataset). Our approach was validated by predicting infection levels for a sample of 30 intact bunches. Classification accuracy improved with the number of decision trees of the Random Forest classifier. These results corresponded with qPCR results. An accuracy of 0.87 was achieved in classification of healthy, infected, and severely diseased bunches. However, discrimination between visually healthy and infected bunches proved to be challenging for a few samples, perhaps due to colonized berries or sparse mycelia hidden within the bunch or airborne conidia on the berries that were detected by qPCR. An advanced approach to hyperspectral image classification based on combined spatial and spectral image features, potentially applicable to many available hyperspectral sensor technologies, has been developed and validated to improve the detection of powdery mildew infection levels of Chardonnay grape bunches. The spatial-spectral approach improved especially the detection of light infection levels compared with pixel-wise spectral data analysis. This approach is expected to improve the speed and accuracy of disease detection once the thresholds for fungal biomass detected by hyperspectral imaging are established; it can also facilitate monitoring in plant phenotyping of grapevine and additional crops.

Journal ArticleDOI
TL;DR: The study demonstrates a fully novel model of segmentation embedded with risk assessment, and uses the combination of SVM and FDR as the optimal pRAS system and yielded a classification accuracy of 99.84% using cross-validation protocol.

Journal ArticleDOI
TL;DR: The obtained results show that multi-class data hyper plane using LDA and threefold SVM approach is effective and simple for quadratic data analysis.