scispace - formally typeset
Search or ask a question
Author

Zena M. Hira

Bio: Zena M. Hira is an academic researcher from Imperial College London. The author has contributed to research in topics: Feature extraction & Support vector machine. The author has an hindex of 3, co-authored 3 publications receiving 567 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: Various ways of performing dimensionality reduction on high-dimensional microarray data are summarised to provide a clearer idea of when to use each one of them for saving computational time and resources.
Abstract: We summarise various ways of performing dimensionality reduction on high-dimensional microarray data. Many different feature selection and feature extraction methods exist and they are being widely used. All these methods aim to remove redundant and irrelevant features so that classification of new instances will be more accurate. A popular source of data is microarrays, a biological platform for gathering gene expressions. Analysing microarrays can be difficult due to the size of the data they provide. In addition the complicated relations among the different genes make analysis more difficult and removing excess features can improve the quality of the results. We present some of the most popular methods for selecting significant features and provide a comparison between them. Their advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of them for saving computational time and resources.

749 citations

Journal ArticleDOI
03 Mar 2014-PLOS ONE
TL;DR: This work has proposed a priori manifold learning for finding a manifold in which a representative set of microarray data is fused with relevant data taken from the KEGG pathway database and found that using this new manifold method gives better classification results than using either PCA or conventional Isomap.
Abstract: Microarray databases are a large source of genetic data, which, upon proper analysis, could enhance our understanding of biology and medicine. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer, and analytical approaches have been applied in order to classify different types of cancer or distinguish between cancerous and non-cancerous tissue. However, microarrays are high-dimensional datasets with high levels of noise and this causes problems when using machine learning methods. A popular approach to this problem is to search for a set of features that will simplify the structure and to some degree remove the noise from the data. The most widely used approach to feature extraction is principal component analysis (PCA) which assumes a multivariate Gaussian model of the data. More recently, non-linear methods have been investigated. Among these, manifold learning algorithms, for example Isomap, aim to project the data from a higher dimensional space onto a lower dimension one. We have proposed a priori manifold learning for finding a manifold in which a representative set of microarray data is fused with relevant data taken from the KEGG pathway database. Once the manifold has been constructed the raw microarray data is projected onto it and clustering and classification can take place. In contrast to earlier fusion based methods, the prior knowledge from the KEGG databases is not used in, and does not bias the classification process—it merely acts as an aid to find the best space in which to search the data. In our experiments we have found that using our new manifold method gives better classification results than using either PCA or conventional Isomap.

8 citations

Journal ArticleDOI
TL;DR: A way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing is investigated.
Abstract: In order to provide the most effective therapy for cancer, it is important to be able to diagnose whether a patient's cancer will respond to a proposed treatment. Methylation profiling could contain information from which such predictions could be made. Currently, hypothesis testing is used to determine whether possible biomarkers for cancer progression produce statistically significant results. However, this approach requires the identification of individual genes, or sets of genes, as candidate hypotheses, and with the increasing size of modern microarrays, this task is becoming progressively harder. Exhaustive testing of small sets of genes is computationally infeasible, and so hypothesis generation depends either on the use of established biological knowledge or on heuristic methods. As an alternative machine learning, methods can be used to identify groups of genes that are acting together within sets of cancer data and associate their behaviors with cancer progression. These methods have the advantage of being multivariate and unbiased but unfortunately also rapidly become computationally infeasible as the number of gene probes and datasets increases. To address this problem, we have investigated a way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing. A methylation dataset is divided into subsets, where each subset contains only the probes that relate to a known gene pathway. Each of these pathway subsets is used independently for classification. The classification method is AdaBoost with decision trees as weak classifiers. Since each pathway subset contains a relatively small number of gene probes, it is possible to train and test its classification accuracy quickly and determine whether it has valuable diagnostic information. Finally, genes from successful pathway subsets can be combined to create a classifier of high accuracy.

4 citations


Cited by
More filters
01 Jan 2002

9,314 citations

Journal ArticleDOI
07 Nov 2019-PLOS ONE
TL;DR: The authors' simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000, while Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size.
Abstract: Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

622 citations

Journal ArticleDOI
TL;DR: Key features of deep learning that may give this approach an edge over other machine learning methods are discussed and a number of applications ofdeep learning in biomedical studies demonstrating proof of concept and practical utility are reviewed.
Abstract: Increases in throughput and installed base of biomedical research equipment led to a massive accumulation of -omics data known to be highly variable, high-dimensional, and sourced from multiple often incompatible data platforms. While this data may be useful for biomarker identification and drug discovery, the bulk of it remains underutilized. Deep neural networks (DNNs) are efficient algorithms based on the use of compositional layers of neurons, with advantages well matched to the challenges -omics data presents. While achieving state-of-the-art results and even surpassing human accuracy in many challenging tasks, the adoption of deep learning in biomedicine has been comparatively slow. Here, we discuss key features of deep learning that may give this approach an edge over other machine learning methods. We then consider limitations and review a number of applications of deep learning in biomedical studies demonstrating proof of concept and practical utility.

532 citations

Journal ArticleDOI
TL;DR: This work demonstrates a deep learning neural net trained on transcriptomic data to recognize pharmacological properties of multiple drugs across different biological systems and conditions and proposes using deep neural net confusion matrices for drug repositioning.
Abstract: Deep learning is rapidly advancing many areas of science and technology with multiple success stories in image, text, voice and video recognition, robotics, and autonomous driving. In this paper we demonstrate how deep neural networks (DNN) trained on large transcriptional response data sets can classify various drugs to therapeutic categories solely based on their transcriptional profiles. We used the perturbation samples of 678 drugs across A549, MCF-7, and PC-3 cell lines from the LINCS Project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled data set of samples perturbed with different concentrations of the drug for 6 and 24 hours. In both pathway and gene level classification, DNN achieved high classification accuracy and convincingly outperformed the support vector machine (SVM) model on every multiclass classification prob...

401 citations

Journal ArticleDOI
TL;DR: There is no group of filter methods that always outperforms all other methods, but recommendations onfilter methods that perform well on many of the data sets are made and groups of filters that are similar with respect to the order in which they rank the features are found.

338 citations