Home
/
Authors
/
Zena M. Hira

Author

Zena M. Hira

Bio: Zena M. Hira is an academic researcher from Imperial College London. The author has contributed to research in topics: Feature extraction & Support vector machine. The author has an hindex of 3, co-authored 3 publications receiving 567 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data.

[...]

Zena M. Hira¹, Duncan Fyfe Gillies¹•Institutions (1)

Imperial College London¹

11 Jun 2015-Advances in Bioinformatics

TL;DR: Various ways of performing dimensionality reduction on high-dimensional microarray data are summarised to provide a clearer idea of when to use each one of them for saving computational time and resources.

...read moreread less

Abstract: We summarise various ways of performing dimensionality reduction on high-dimensional microarray data. Many different feature selection and feature extraction methods exist and they are being widely used. All these methods aim to remove redundant and irrelevant features so that classification of new instances will be more accurate. A popular source of data is microarrays, a biological platform for gathering gene expressions. Analysing microarrays can be difficult due to the size of the data they provide. In addition the complicated relations among the different genes make analysis more difficult and removing excess features can improve the quality of the results. We present some of the most popular methods for selecting significant features and provide a comparison between them. Their advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of them for saving computational time and resources.

...read moreread less

749 citations

Journal Article•DOI•

An algorithm for finding biologically significant features in microarray data based on a priori manifold learning.

[...]

Zena M. Hira¹, George Trigeorgis¹, Duncan Fyfe Gillies¹•Institutions (1)

Imperial College London¹

03 Mar 2014-PLOS ONE

TL;DR: This work has proposed a priori manifold learning for finding a manifold in which a representative set of microarray data is fused with relevant data taken from the KEGG pathway database and found that using this new manifold method gives better classification results than using either PCA or conventional Isomap.

...read moreread less

Abstract: Microarray databases are a large source of genetic data, which, upon proper analysis, could enhance our understanding of biology and medicine. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer, and analytical approaches have been applied in order to classify different types of cancer or distinguish between cancerous and non-cancerous tissue. However, microarrays are high-dimensional datasets with high levels of noise and this causes problems when using machine learning methods. A popular approach to this problem is to search for a set of features that will simplify the structure and to some degree remove the noise from the data. The most widely used approach to feature extraction is principal component analysis (PCA) which assumes a multivariate Gaussian model of the data. More recently, non-linear methods have been investigated. Among these, manifold learning algorithms, for example Isomap, aim to project the data from a higher dimensional space onto a lower dimension one. We have proposed a priori manifold learning for finding a manifold in which a representative set of microarray data is fused with relevant data taken from the KEGG pathway database. Once the manifold has been constructed the raw microarray data is projected onto it and clustering and classification can take place. In contrast to earlier fusion based methods, the prior knowledge from the KEGG databases is not used in, and does not bias the classification process—it merely acts as an aid to find the best space in which to search the data. In our experiments we have found that using our new manifold method gives better classification results than using either PCA or conventional Isomap.

...read moreread less

8 citations

Journal Article•DOI•

Identifying Significant Features in Cancer Methylation Data Using Gene Pathway Segmentation

[...]

Zena M. Hira¹, Duncan Fyfe Gillies¹•Institutions (1)

Imperial College London¹

20 Sep 2016-Cancer Informatics

TL;DR: A way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing is investigated.

...read moreread less

Abstract: In order to provide the most effective therapy for cancer, it is important to be able to diagnose whether a patient's cancer will respond to a proposed treatment. Methylation profiling could contain information from which such predictions could be made. Currently, hypothesis testing is used to determine whether possible biomarkers for cancer progression produce statistically significant results. However, this approach requires the identification of individual genes, or sets of genes, as candidate hypotheses, and with the increasing size of modern microarrays, this task is becoming progressively harder. Exhaustive testing of small sets of genes is computationally infeasible, and so hypothesis generation depends either on the use of established biological knowledge or on heuristic methods. As an alternative machine learning, methods can be used to identify groups of genes that are acting together within sets of cancer data and associate their behaviors with cancer progression. These methods have the advantage of being multivariate and unbiased but unfortunately also rapidly become computationally infeasible as the number of gene probes and datasets increases. To address this problem, we have investigated a way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing. A methylation dataset is divided into subsets, where each subset contains only the probes that relate to a known gene pathway. Each of these pathway subsets is used independently for classification. The classification method is AdaBoost with decision trees as weak classifiers. Since each pathway subset contains a relatively small number of gene probes, it is possible to train and test its classification accuracy quickly and determine whether it has valuable diagnostic information. Finally, genes from successful pathway subsets can be combined to create a classifier of high accuracy.

...read moreread less

4 citations

Cited by

PDF

Open Access

More filters

Data Mining - Concepts and Techniques.

[...]

Petra Perner

01 Jan 2002

9,314 citations

Journal Article•DOI•

Machine learning algorithm validation with a limited sample size

[...]

Andrius Vabalas¹, Emma Gowen¹, Ellen Poliakoff¹, Alexander J. Casson¹•Institutions (1)

University of Manchester¹

07 Nov 2019-PLOS ONE

TL;DR: The authors' simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000, while Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size.

...read moreread less

Abstract: Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

...read moreread less

622 citations

Journal Article•DOI•

Applications of Deep Learning in Biomedicine

[...]

Polina Mamoshina¹, Armando Vieira, Evgeny Putin¹, Alex Zhavoronkov¹•Institutions (1)

Johns Hopkins University¹

29 Mar 2016-Molecular Pharmaceutics

TL;DR: Key features of deep learning that may give this approach an edge over other machine learning methods are discussed and a number of applications ofdeep learning in biomedical studies demonstrating proof of concept and practical utility are reviewed.

...read moreread less

Abstract: Increases in throughput and installed base of biomedical research equipment led to a massive accumulation of -omics data known to be highly variable, high-dimensional, and sourced from multiple often incompatible data platforms. While this data may be useful for biomarker identification and drug discovery, the bulk of it remains underutilized. Deep neural networks (DNNs) are efficient algorithms based on the use of compositional layers of neurons, with advantages well matched to the challenges -omics data presents. While achieving state-of-the-art results and even surpassing human accuracy in many challenging tasks, the adoption of deep learning in biomedicine has been comparatively slow. Here, we discuss key features of deep learning that may give this approach an edge over other machine learning methods. We then consider limitations and review a number of applications of deep learning in biomedical studies demonstrating proof of concept and practical utility.

...read moreread less

532 citations

Journal Article•DOI•

Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data

[...]

Alexander Aliper¹, Sergey M. Plis², Artem V. Artemov¹, Alvaro Ulloa³, Polina Mamoshina¹, Alex Zhavoronkov¹, Alex Zhavoronkov⁴ - Show less +3 more•Institutions (4)

Johns Hopkins University¹, Yale University², The Mind Research Network³, University of Oxford⁴

08 Jun 2016-Molecular Pharmaceutics

TL;DR: This work demonstrates a deep learning neural net trained on transcriptomic data to recognize pharmacological properties of multiple drugs across different biological systems and conditions and proposes using deep neural net confusion matrices for drug repositioning.

...read moreread less

Abstract: Deep learning is rapidly advancing many areas of science and technology with multiple success stories in image, text, voice and video recognition, robotics, and autonomous driving. In this paper we demonstrate how deep neural networks (DNN) trained on large transcriptional response data sets can classify various drugs to therapeutic categories solely based on their transcriptional profiles. We used the perturbation samples of 678 drugs across A549, MCF-7, and PC-3 cell lines from the LINCS Project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled data set of samples perturbed with different concentrations of the drug for 6 and 24 hours. In both pathway and gene level classification, DNN achieved high classification accuracy and convincingly outperformed the support vector machine (SVM) model on every multiclass classification prob...

...read moreread less

401 citations

Journal Article•DOI•

Benchmark for filter methods for feature selection in high-dimensional classification data

[...]

Andrea Bommert¹, Xudong Sun², Bernd Bischl², Jörg Rahnenführer¹, Michel Lang¹ - Show less +1 more•Institutions (2)

Technical University of Dortmund¹, Ludwig Maximilian University of Munich²

01 Mar 2020-Computational Statistics & Data Analysis

TL;DR: There is no group of filter methods that always outperforms all other methods, but recommendations onfilter methods that perform well on many of the data sets are made and groups of filters that are similar with respect to the order in which they rank the features are found.

...read moreread less

338 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152

Collapse