scispace - formally typeset
Search or ask a question
Author

Zhiyuan Luo

Bio: Zhiyuan Luo is an academic researcher from Royal Holloway, University of London. The author has contributed to research in topics: Support vector machine & Computer science. The author has an hindex of 19, co-authored 105 publications receiving 1354 citations. Previous affiliations of Zhiyuan Luo include University of London & Heriot-Watt University.


Papers
More filters
Journal ArticleDOI
TL;DR: Investigation of the influence of different preanalytic handling methods on surface-enhanced laser desorption/ionization time-of-flight mass spectrometry protein profiles of prefractionated serum found changes in preanalytical handling variables affect profiles of serum proteins, including proposed disease biomarkers.
Abstract: Background: High-throughput proteomic methods for disease biomarker discovery in human serum are promising, but concerns exist regarding reproducibility of results and variability introduced by sample handling. This study investigated the influence of different preanalytic handling methods on surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS) protein profiles of prefractionated serum. We investigated whether older collections with longer sample transit times yield useful protein profiles, and sought to establish the most feasible collection methods for future clinical proteomic studies. Methods: To examine the effect of tube type, clotting time, transport/incubation time, temperature, and storage method on protein profiles, we used 6 different handling methods to collect sera from 25 healthy volunteers. We used a high-throughput, prefractionation strategy to generate anion-exchange fractions and examined their protein profiles on CM10, IMAC30-Cu, and H50 arrays by using surface-enhanced laser desorption/ionization time-of-flight mass spectrometry. Results: Prolonged transport and incubation at room temperature generated low mass peaks, resulting in distinctions among the protocols. The most and least stringent methods gave the lowest overall peak variances, indicating that proteolysis in the latter may have been nearly complete. For samples transported on ice there was little effect of clotting time, storage method, or transit time. Certain proteins (TTR, ApoCI, and transferrin) were unaffected by handling, but others (ITIH4 and hemoglobin β) displayed significant variability. Conclusions: Changes in preanalytical handling variables affect profiles of serum proteins, including proposed disease biomarkers. Proteomic analysis of samples from serum banks collected using less stringent protocols is applicable if all samples are handled identically.

149 citations

Journal ArticleDOI
TL;DR: The prevalence of intrinsic stability of Vims demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of V IMs.
Abstract: The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability. The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability. First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets.

93 citations

Book ChapterDOI
01 Apr 2007
TL;DR: Leave-one-out cross validation (LOOCV) classification results on two datasets: Breast Cancer and ALL/AML leukemia, demonstrate the proposed method can get 100% success rate with the final reduced subset.
Abstract: Gene selection is an important problem in microarray data processing. A new gene selection method based on Wilcoxon rank sum test and Support Vector Machine (SVM) is proposed in this paper. First, Wilcoxon rank sum test is used to select a subset. Then each selected gene is trained and tested using SVM classifier with linear kernel separately, and genes with high testing accuracy rates are chosen to form the final reduced gene subset. Leave-one-out cross validation (LOOCV) classification results on two datasets: Breast Cancer and ALL/AML leukemia, demonstrate the proposed method can get 100% success rate with the final reduced subset. The selected genes are listed and their expression levels are sketched, which show that the selected genes can make clear separation between two classes.

65 citations

Journal ArticleDOI
TL;DR: In this article, a Partial Least Square Regression (PLS) model was used for coal properties (total moisture (Mt), inherent moisture (Minh), ash (Ash), volatile matter (VM), fixed carbon (FC), and sulfur (S)) with the NIRS of 199 samples.
Abstract: Using near infrared reflectance spectra (NIRS) for rapid coal property analysis is convenient, fast, safe and could be used as online analysis method. This study first built Partial Least Square regression (PLS regression) models for six coal properties (total moisture (Mt), inherent moisture (Minh), ash (Ash), volatile matter (VM), fixed carbon (FC), and sulfur (S)) with the NIRS of 199 samples. The 199 samples came from different mines including 4 types of coal (fat coal, coking coal, lean coal and meager lean coal). In comparison, models for the six properties according to different types were built. Results show that models for different types are more effective than that of the entire sample set. A new method for coal classification was then obtained by applying Principle Components Analysis (PCA) and Support Vector Machine (SVM) to the spectra of the coal samples, which was of high classification accuracy and time saving. At last, different PLS regression models were built for different types classified by the new method and got better prediction results than that of full samples. Thus, the predictive ability was improved by fitting the coal samples into corresponding models using the SVM classification.

53 citations

Journal ArticleDOI
TL;DR: The transductive confidence machine is applied to the problems of acute leukaemia and ovarian cancer prediction using microarray and proteomics pattern diagnostics, demonstrating that the algorithm performs well, yielding well-calibrated and informative predictions whilst maintaining a high level of accuracy.
Abstract: We focus on the problem of prediction with confidence and describe a recently developed learning algorithm called transductive confidence machine for making qualified region predictions. Its main advantage, in comparison with other classifiers, is that it is well-calibrated, with number of prediction errors strictly controlled by a given predefined confidence level. We apply the transductive confidence machine to the problems of acute leukaemia and ovarian cancer prediction using microarray and proteomics pattern diagnostics, respectively. We demonstrate that the algorithm performs well, yielding well-calibrated and informative predictions whilst maintaining a high level of accuracy.

44 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Journal ArticleDOI
31 Mar 2020-Science
TL;DR: A mathematical model for infectiousness was developed to estimate the basic reproductive number R0 and to quantify the contribution of different transmission routes and the requirements for successful contact tracing, and the combination of two key parameters needed to reduce R0 to less than 1 was determined.
Abstract: The newly emergent human virus SARS-CoV-2 (severe acute respiratory syndrome-coronavirus 2) is resulting in high fatality rates and incapacitated health systems. Preventing further transmission is a priority. We analyzed key parameters of epidemic spread to estimate the contribution of different transmission routes and determine requirements for case isolation and contact tracing needed to stop the epidemic. Although SARS-CoV-2 is spreading too fast to be contained by manual contact tracing, it could be controlled if this process were faster, more efficient, and happened at scale. A contact-tracing app that builds a memory of proximity contacts and immediately notifies contacts of positive cases can achieve epidemic control if used by enough people. By targeting recommendations to only those at risk, epidemics could be contained without resorting to mass quarantines ("lockdowns") that are harmful to society. We discuss the ethical requirements for an intervention of this kind.

2,340 citations

Journal ArticleDOI
TL;DR: The occurrence of clonal T cell acute lymphoblastic leukemia (T-ALL) promoted by insertional mutagenesis in a completed gene therapy trial of 10 SCID-X1 patients is described and a general toxicity of endogenous gammaretroviral enhancer elements is highlighted.
Abstract: X-linked SCID (SCID-X1) is amenable to correction by gene therapy using conventional gammaretroviral vectors. Here, we describe the occurrence of clonal T cell acute lymphoblastic leukemia (T-ALL) promoted by insertional mutagenesis in a completed gene therapy trial of 10 SCID-X1 patients. Integration of the vector in an antisense orientation 35 kb upstream of the protooncogene LIM domain only 2 (LMO2) caused overexpression of LMO2 in the leukemic clone. However, leukemogenesis was likely precipitated by the acquisition of other genetic abnormalities unrelated to vector insertion, including a gain-of-function mutation in NOTCH1, deletion of the tumor suppressor gene locus cyclin-dependent kinase 2A (CDKN2A), and translocation of the TCR-β region to the STIL-TAL1 locus. These findings highlight a general toxicity of endogenous gammaretroviral enhancer elements and also identify a combinatorial process during leukemic evolution that will be important for risk stratification and for future protocol design.

1,162 citations

Journal ArticleDOI
TL;DR: A genome-wide study to improve prognostic classification of ALL in children revealed a new ALL subtype, the underlying genetic abnormalities of which were characterised by comparative genomic hybridisation-arrays and molecular cytogenetics.
Abstract: Summary Background Genetic subtypes of acute lymphoblastic leukaemia (ALL) are used to determine risk and treatment in children. 25% of precursor B-ALL cases are genetically unclassified and have intermediate prognosis. We aimed to use a genome-wide study to improve prognostic classification of ALL in children. Methods We constructed a classifier based on gene expression in 190 children with newly diagnosed ALL (German Cooperative ALL [COALL] discovery cohort) by use of double-loop cross-validation and validated this in an independent cohort of 107 newly diagnosed patients (Dutch Childhood Oncology Group [DCOG] independent validation cohort). Hierarchical cluster analysis with classifying gene-probe sets revealed a new ALL subtype, the underlying genetic abnormalities of which were characterised by comparative genomic hybridisation-arrays and molecular cytogenetics. Findings Our classifier predicted ALL subtype with a median accuracy of 90·0% (IQR 88·3–91·7) in the discovery cohort and correctly identified 94 of 107 patients (accuracy 87·9%) in the independent validation cohort. Without our classifier, 44 children in the COALL cohort and 33 children in the DCOG cohort would have been classified as B-other. However, hierarchical clustering showed that many of these genetically unclassified cases clustered with BCR–ABL1 -positive cases: 30 (19%) of 154 children with precursor B-ALL in the COALL cohort and 14 (15%) of 92 children with precursor B-ALL in the DCOG cohort had this BCR–ABL1 -like disease. In the COALL cohort, these patients had unfavourable outcome (5-year disease-free survival 59·5%, 95% CI 37·1–81·9) compared with patients with other precursor B-ALL (84·4%, 76·8–92·1%; p=0·012), a prognosis similar to that of patients with BCR–ABL1 -positive ALL (51·9%, 23·1–80·6%). In the DCOG cohort, the prognosis of BCR–ABL1 -like disease (57·1%, 31·2–83·1%) was worse than that of other precursor B-ALL (79·2%, 70·2–88·3%; p=0.026), and similar to that of BCR–ABL1 -positive ALL (32·5%, 2·3–62·7%). 36 (82%) of the patients with BCR–ABL1 -like disease had deletions in genes involved in B-cell development, including IKZF1, TCF3, EBF1, PAX5, and VPREB1 ; only nine (36%) of 25 patients with B-other ALL had deletions in these genes (p=0·0002). Compared with other precursor B-ALL cells, BCR–ABL1 -like cells were 73 times more resistant to L-asparaginase (p=0·001) and 1·6 times more resistant to daunorubicin (p=0·017), but toxicity of prednisolone and vincristine did not differ. Interpretation New treatment strategies are needed to improve outcome for this newly identified high-risk subtype of ALL. Funding Dutch Cancer Society, Sophia Foundation for Medical Research, Paediatric Oncology Foundation Rotterdam, Centre of Medical Systems Biology of the Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research, American National Institute of Health, American National Cancer Institute, and American Lebanese Syrian Associated Charities.

800 citations