scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Prediction and Validation of Disease Genes Using HeteSim Scores

TL;DR: A novel relevance measure, called HeteSim, is used, to prioritize candidate disease genes, and it is found that HSSVM avoid the disadvantage of the existing machine learning based methods, which always predict similar genes for different diseases.
Abstract: Deciphering the gene disease association is an important goal in biomedical research. In this paper, we use a novel relevance measure, called HeteSim, to prioritize candidate disease genes. Two methods based on heterogeneous networks constructed using protein-protein interaction, gene-phenotype associations, and phenotype-phenotype similarity, are presented. In HeteSim_MultiPath (HSMP), HeteSim scores of different paths are combined with a constant that dampens the contributions of longer paths. In HeteSim_SVM (HSSVM), HeteSim scores are combined with a machine learning method. The 3-fold experiments show that our non-machine learning method HSMP performs better than the existing non-machine learning methods, our machine learning method HSSVM obtains similar accuracy with the best existing machine learning method CATAPULT. From the analysis of the top 10 predicted genes for different diseases, we found that HSSVM avoid the disadvantage of the existing machine learning based methods, which always predict similar genes for different diseases. The data sets and Matlab code for the two methods are freely available for download at http://lab.malab.cn/data/HeteSim/index.jsp .
Citations
More filters
Journal ArticleDOI
TL;DR: Twenty state-of-the-art computational models of predicting miRNA-disease associations from different perspectives are reviewed, including five feasible and important research schemas, and future directions for further development of computational models are summarized.
Abstract: Circular RNAs (circRNAs) are a class of single-stranded, covalently closed RNA molecules with a variety of biological functions. Studies have shown that circRNAs are involved in a variety of biological processes and play an important role in the development of various complex diseases, so the identification of circRNA-disease associations would contribute to the diagnosis and treatment of diseases. In this review, we summarize the discovery, classifications and functions of circRNAs and introduce four important diseases associated with circRNAs. Then, we list some significant and publicly accessible databases containing comprehensive annotation resources of circRNAs and experimentally validated circRNA-disease associations. Next, we introduce some state-of-the-art computational models for predicting novel circRNA-disease associations and divide them into two categories, namely network algorithm-based and machine learning-based models. Subsequently, several evaluation methods of prediction performance of these computational models are summarized. Finally, we analyze the advantages and disadvantages of different types of computational models and provide some suggestions to promote the development of circRNA-disease association identification from the perspective of the construction of new computational models and the accumulation of circRNA-related data.

473 citations

Journal ArticleDOI
Ruolan Chen1, Xiangrong Liu1, Shuting Jin1, Jiawei Lin1, Juan Liu1 
TL;DR: A hierarchical classification scheme is adopted and several representative methods of each category of drug-target interaction prediction are introduced, especially the recent state-of-the-art methods.
Abstract: Identifying drug-target interactions will greatly narrow down the scope of search of candidate medications, and thus can serve as the vital first step in drug discovery Considering that in vitro experiments are extremely costly and time-consuming, high efficiency computational prediction methods could serve as promising strategies for drug-target interaction (DTI) prediction In this review, our goal is to focus on machine learning approaches and provide a comprehensive overview First, we summarize a brief list of databases frequently used in drug discovery Next, we adopt a hierarchical classification scheme and introduce several representative methods of each category, especially the recent state-of-the-art methods In addition, we compare the advantages and limitations of methods in each category Lastly, we discuss the remaining challenges and future outlook of machine learning in DTI prediction This article may provide a reference and tutorial insights on machine learning-based DTI prediction for future researchers

162 citations


Cites methods from "Prediction and Validation of Diseas..."

  • ...Computational methods have achieved favorable performance in many related bioinformatics fields, such as disease-related miRNA prediction [7–9], disease genes prediction [10], protein-protein interaction prediction [11] and protein subcellular location prediction [12]....

    [...]

Journal ArticleDOI
TL;DR: An algorithm for predicting gene expression values based on XGBoost, which integrates multiple tree models and has stronger interpretability and outperforms existing models and will be a significant contribution to the toolbox for gene expression value prediction.
Abstract: Gene expression profiling has been widely used to characterize cell status to reflect the health of the body, to diagnose genetic diseases, etc. In recent years, although the cost of genome-wide expression profiling is gradually decreasing, the cost of collecting expression profiles for thousands of genes is still very high. Considering gene expressions are usually highly correlated in humans, the expression values of the remaining target genes can be predicted by analyzing the values of 943 landmark genes. Hence, we designed an algorithm for predicting gene expression values based on XGBoost, which integrates multiple tree models and has stronger interpretability. We tested the performance of XGBoost model on the GEO dataset and RNA-seq dataset and compared the result with other existing models. Experiments showed that the XGBoost model achieved a significantly lower overall error than the existing D-GEX algorithm, linear regression, and KNN methods. In conclusion, the XGBoost algorithm outperforms existing models and will be a significant contribution to the toolbox for gene expression value prediction.

127 citations


Cites background from "Prediction and Validation of Diseas..."

  • ...Gene expression profiling is a vital biological tool commonly used to capture the response of cells to disease or drug treatments (Celis et al., 2000; Mclachlan et al., 2005; Wang et al., 2006; Mallick et al., 2009; Zeng et al., 2016)....

    [...]

Journal ArticleDOI
TL;DR: The benchmark dataset, feature extraction, machine learning method and published results were summarized and the perspective of machine learning methods in protein sub-Golgi apparatus localization prediction was pointed out.
Abstract: The location of proteins in a cell can provide important clues to their functions in various biological processes. Thus, the application of machine learning method in the prediction of protein subcellular localization has become a hotspot in bioinformatics. As one of key organelles, the Golgi apparatus is in charge of protein storage, package, and distribution.The identification of protein location in Golgi apparatus will provide in-depth insights into their functions. Thus, the machine learning-based method of predicting protein location in Golgi apparatus has been extensively explored. The development of protein sub-Golgi apparatus localization prediction should be reviewed for providing a whole background for the fields.The benchmark dataset, feature extraction, machine learning method and published results were summarized.We briefly introduced the recent progresses in protein sub-Golgi apparatus localization prediction using machine learning methods and discussed their advantages and disadvantages.We pointed out the perspective of machine learning methods in protein sub-Golgi localization prediction.

113 citations

Journal ArticleDOI
TL;DR: A stacked ensemble model PredT4SE-Stack was developed to predict T4SEs, which utilized an ensemble of base-classifiers implemented by various machine learning algorithms, such as support vector machine, gradient boosting machine, and extremely randomized trees, to generate outputs for the meta-classifier in the classification system.
Abstract: Gram-negative bacteria use various secretion systems to deliver their secreted effectors. Among them, type IV secretion system exists widely in a variety of bacterial species, and secretes type IV secreted effectors (T4SEs), which play vital roles in host-pathogen interactions. However, experimental approaches to identify T4SEs are time- and resource-consuming. In the present study, we aim to develop an in silico stacked ensemble method to predict whether a protein is an effector of type IV secretion system or not based on its sequence information. The protein sequences were encoded by the feature of position specific scoring matrix (PSSM)-composition by summing rows that correspond to the same amino acid residues in PSSM profiles. Based on the PSSM-composition features, we develop a stacked ensemble model PredT4SE-Stack to predict T4SEs, which utilized an ensemble of base-classifiers implemented by various machine learning algorithms, such as support vector machine, gradient boosting machine, and extremely randomized trees, to generate outputs for the meta-classifier in the classification system. Our results demonstrated that the framework of PredT4SE-Stack was a feasible and effective way to accurately identify T4SEs based on protein sequence information. The datasets and source code of PredT4SE-Stack are freely available at http://xbioinfo.sjtu.edu.cn/PredT4SE_Stack/index.php.

97 citations

References
More filters
Journal IssueDOI
TL;DR: Experiments on large coauthorship networks suggest that information about future interactions can be extracted from network topology alone, and that fairly subtle measures for detecting node proximity can outperform more direct measures.
Abstract: Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future? We formalize this question as the link-prediction problem, and we develop approaches to link prediction based on measures for analyzing the “proximity” of nodes in a network. Experiments on large coauthorship networks suggest that information about future interactions can be extracted from network topology alone, and that fairly subtle measures for detecting node proximity can outperform more direct measures. © 2007 Wiley Periodicals, Inc.

4,181 citations


"Prediction and Validation of Diseas..." refers methods in this paper

  • ...[14] introduce the Katz method, which has been successfully applied for link prediction in social networks [15], into the disease genes prediction problem....

    [...]

Journal ArticleDOI
TL;DR: A number of new features in HPRD are added, including PhosphoMotif Finder, which allows users to find the presence of over 320 experimentally verified phosphorylation motifs in proteins of interest, and a protein distributed annotation system—Human Proteinpedia.
Abstract: Human Protein Reference Database (HPRD--http://www.hprd.org/), initially described in 2003, is a database of curated proteomic information pertaining to human proteins. We have recently added a number of new features in HPRD. These include PhosphoMotif Finder, which allows users to find the presence of over 320 experimentally verified phosphorylation motifs in proteins of interest. Another new feature is a protein distributed annotation system--Human Proteinpedia (http://www.humanproteinpedia.org/)--through which laboratories can submit their data, which is mapped onto protein entries in HPRD. Over 75 laboratories involved in proteomics research have already participated in this effort by submitting data for over 15,000 human proteins. The submitted data includes mass spectrometry and protein microarray-derived data, among other data types. Finally, HPRD is also linked to a compendium of human signaling pathways developed by our group, NetPath (http://www.netpath.org/), which currently contains annotations for several cancer and immune signaling pathways. Since the last update, more than 5500 new protein sequences have been added, making HPRD a comprehensive resource for studying the human proteome.

3,081 citations


"Prediction and Validation of Diseas..." refers methods in this paper

  • ...Two different networks HumanNet [18] and HPRD network [19] are used....

    [...]

Journal ArticleDOI
TL;DR: Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative and timely knowledgebase of human genes and genetic disorders compiled to support research and education in human genomics and the practice of clinical genetics.
Abstract: Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative and timely knowledgebase of human genes and genetic disorders compiled to support human genetics research and education and the practice of clinical genetics. Started by Dr Victor A. McKusick as the definitive reference Mendelian Inheritance in Man, OMIM (http://www.ncbi.nlm.nih.gov/omim/) is now distributed electronically by the National Center for Biotechnology Information, where it is integrated with the Entrez suite of databases. Derived from the biomedical literature, OMIM is written and edited at Johns Hopkins University with input from scientists and physicians around the world. Each OMIM entry has a full-text summary of a genetically determined phenotype and/or gene and has numerous links to other genetic databases such as DNA and protein sequence, PubMed references, general and locus-specific mutation databases, HUGO nomenclature, MapViewer, GeneTests, patient support groups and many others. OMIM is an easy and straightforward portal to the burgeoning information in human genetics.

2,715 citations

Journal ArticleDOI
TL;DR: In this paper, the risks of breast and ovarian cancer from the occurrence of second cancers in individuals with breast cancer, and examined the risk of other cancers in BRCA1 carriers.

1,826 citations

Journal ArticleDOI
TL;DR: The distribution of types of mutation in mendelian disease genes argues for serious consideration of the early application of a genomic-scale sequence-based approach to association studies and against complete reliance on a positional cloning approach based on a map of anonymous single nucleotide polymorphism haplotypes.
Abstract: The past two decades have witnessed an explosion in the identification, largely by positional cloning, of genes associated with mendelian diseases The roughly 1,200 genes that have been characterized have clarified our understanding of the molecular basis of human genetic disease The principles derived from these successes should be applied now to strategies aimed at finding the considerably more elusive genes that underlie complex disease phenotypes The distribution of types of mutation in mendelian disease genes argues for serious consideration of the early application of a genomic-scale sequence-based approach to association studies and against complete reliance on a positional cloning approach based on a map of anonymous single nucleotide polymorphism haplotypes

1,489 citations