scispace - formally typeset
Search or ask a question

Showing papers by "Tao Huang published in 2021"


Journal ArticleDOI
TL;DR: A comprehensive survey on the development and application of AI in different aspects of fundamental sciences, including information science, mathematics, medical science, materials science, geoscience, life science, physics and chemistry, is presented in this article.
Abstract: Artificial Intelligence (AI) coupled with promising machine learning (ML) techniques well known from computer science is broadly affecting many aspects of various fields including science and technology, industry, and even our day to day life. The ML techniques have been developed to analyze high-throughput data with a view to obtaining useful insights, categorizing, predicting and making evidence-based decisions in novel ways, which will promote the growth of novel applications and fuel the sustainable booming of AI. This paper undertakes performs a comprehensive survey on the development and application of AI in different aspects of fundamental sciences, including information science, mathematics, medical science, materials science, geoscience, life science, physics and chemistry. The challenges that each discipline of science meets, and the potentials of AI techniques to handle these challenges, are discussed in detail. Moreover, we shed light on new research trends entailing the integration of AI into each scientific discipline. The goal of this paper is to provide a broad research guideline on fundamental sciences with potential infusion of AI, to help motivate researchers to deeply understand the state-of-the-art applications of AI-based fundamental sciences, and thereby to help promote the continuous development of these fundamental sciences.

90 citations


Journal ArticleDOI
TL;DR: In this article, using the recently reported transcriptomics data of upper airway tissue with acute respiratory illnesses, integrated multiple machine learning methods to identify effective qualitative biomarkers and quantitative rules for the distinction of SARS-CoV-2 infection from other infectious diseases.
Abstract: The world-wide Coronavirus Disease 2019 (COVID-19) pandemic was triggered by the widespread of a new strain of coronavirus named as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Multiple studies on the pathogenesis of SARS-CoV-2 have been conducted immediately after the spread of the disease. However, the molecular pathogenesis of the virus and related diseases has still not been fully revealed. In this study, we attempted to identify new transcriptomic signatures as candidate diagnostic models for clinical testing or as therapeutic targets for vaccine design. Using the recently reported transcriptomics data of upper airway tissue with acute respiratory illnesses, we integrated multiple machine learning methods to identify effective qualitative biomarkers and quantitative rules for the distinction of SARS-CoV-2 infection from other infectious diseases. The transcriptomics data was first analyzed by Boruta so that important features were selected, which were further evaluated by the minimum redundancy maximum relevance method. A feature list was produced. This list was fed into the incremental feature selection, incorporating some classification algorithms, to extract qualitative biomarker genes and construct quantitative rules. Also, an efficient classifier was built to identify patients infected with SARS-COV-2. The findings reported in this study may help in revealing the potential pathogenic mechanisms of COVID-19 and finding new targets for vaccine design.

48 citations


Journal ArticleDOI
TL;DR: In this article, a rule-based computational method using gene ontology and KEGG pathway annotation of protein-protein interactions (PPIs) participators was proposed to identify a group of biological functions that are tightly associated with PPIs and provided a new function-based tool for PPI studies.

46 citations


Journal ArticleDOI
TL;DR: In this paper, an embedding-based method for predicting the subcellular localization of proteins is presented. But the method requires further improvement, especially when used in protein representations.
Abstract: The functions of proteins are mainly determined by their subcellular localizations in cells. Currently, many computational methods for predicting the subcellular localization of proteins have been proposed. However, these methods require further improvement, especially when used in protein representations. In this study, we present an embedding-based method for predicting the subcellular localization of proteins. We first learn the functional embeddings of KEGG/GO terms, which are further used in representing proteins. Then, we characterize the network embeddings of proteins on a protein-protein network. The functional and network embeddings are combined as novel representations of protein locations for the construction of the final classification model. In our collected benchmark dataset with 4,861 proteins from 16 locations, the best model shows a Matthews correlation coefficient of 0.872 and is thus superior to multiple conventional methods.

35 citations


Journal ArticleDOI
TL;DR: A systematic review of studies on the epidemiology, immunological pathogenesis, molecular mechanisms, and structural biology, as well as approaches for drug or vaccine development for SARS-CoV-2 is provided in this article.
Abstract: COVID-19 has spread globally to over 200 countries with more than 40 million confirmed cases and one million deaths as of November 1, 2020. The SARS-CoV-2 virus, leading to COVID-19, shows extremely high rates of infectivity and replication, and can result in pneumonia, acute respiratory distress, or even mortality. SARS-CoV-2 has been found to continue to rapidly evolve, with several genomic variants emerging in different regions throughout the world. In addition, despite intensive study of the spike protein, its origin, and molecular mechanisms in mediating host invasion are still only partially resolved. Finally, the repertoire of drugs for COVID-19 treatment is still limited, with several candidates still under clinical trial and no effective therapeutic yet reported. Although vaccines based on either DNA/mRNA or protein have been deployed, their efficacy against emerging variants requires ongoing study, with multivalent vaccines supplanting the first-generation vaccines due to their low efficacy against new strains. Here, we provide a systematic review of studies on the epidemiology, immunological pathogenesis, molecular mechanisms, and structural biology, as well as approaches for drug or vaccine development for SARS-CoV-2.

33 citations


Journal ArticleDOI
TL;DR: In this paper, the authors identified specific regulatory factors and a series of rules that contribute to the activation and stimulation of airway smooth muscles by IL-13, IL-17, or the combination of both interleukins on the epigenetic and/or transcriptional levels.
Abstract: Smooth muscles are a specific muscle subtype that is widely identified in the tissues of internal passageways. This muscle subtype has the capacity for controlled or regulated contraction and relaxation. Airway smooth muscles are a unique type of smooth muscles that constitute the effective, adjustable, and reactive wall that covers most areas of the entire airway from the trachea to lung tissues. Infection with SARS-CoV-2, which caused the world-wide COVID-19 pandemic, involves airway smooth muscles and their surrounding inflammatory environment. Therefore, airway smooth muscles and related inflammatory factors may play an irreplaceable role in the initiation and progression of several severe diseases. Many previous studies have attempted to reveal the potential relationships between interleukins and airway smooth muscle cells only on the omics level, and the continued existence of numerous false-positive optimal genes/transcripts cannot reflect the actual effective biological mechanisms underlying interleukin-based activation effects on airway smooth muscles. Here, on the basis of newly presented machine learning-based computational approaches, we identified specific regulatory factors and a series of rules that contribute to the activation and stimulation of airway smooth muscles by IL-13, IL-17, or the combination of both interleukins on the epigenetic and/or transcriptional levels. The detected discriminative factors (genes) and rules can contribute to the identification of potential regulatory mechanisms linking airway smooth muscle tissues and inflammatory factors and help reveal specific pathological factors for diseases associated with airway smooth muscle inflammation on multiomics levels.

27 citations


Journal ArticleDOI
TL;DR: In this paper, a network embedding-based method, node2loc, was proposed to identify protein subcellular locations by taking protein-protein interactions (PPIs) into account.
Abstract: Identifying protein subcellular locations is an important topic in protein function prediction. Interacting proteins may share similar locations. Thus, it is imperative to infer protein subcellular locations by taking protein-protein interactions (PPIs) into account. In this study, we present a network embedding-based method, node2loc, to identify protein subcellular locations. node2loc first learns distributed embeddings of proteins in a protein-protein interaction (PPI) network using node2vec. Then the learned embeddings are further fed into a recurrent neural network (RNN). To resolve the severe class imbalance of different subcellular locations, Synthetic Minority Over-sampling Technique (SMOTE) is applied to artificially synthesize proteins for minority classes. node2loc is evaluated on our constructed human benchmark dataset with 16 subcellular locations and yields a Matthews correlation coefficient (MCC) value of 0.800, which is superior to baseline methods. In addition, node2loc yields a better performance on a Yeast benchmark dataset with 17 locations. The results demonstrate that the learned representations from a PPI network have certain discriminative ability for classifying protein subcellular locations. However, node2loc is a transductive method, it only works for proteins connected in a PPI network, and it needs to be retrained for new proteins. In addition, the PPI network needs be annotated to some extent with location information.

24 citations


Journal ArticleDOI
TL;DR: In this paper, the authors calculated the KEGG and Gene Ontology (GO) enrichment scores of all fibrotic disease genes and compared them with other genes using the Monte Carlo feature selection (MCFS) method.
Abstract: Acute and chronic inflammation often leads to fibrosis, which is also the common and final pathological outcome of chronic inflammatory diseases. To explore the common genes and pathogenic pathways among different fibrotic diseases, we collected all the reported genes of the eight fibrotic diseases: eye fibrosis, heart fibrosis, hepatic fibrosis, intestinal fibrosis, lung fibrosis, pancreas fibrosis, renal fibrosis, and skin fibrosis. We calculated the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) enrichment scores of all fibrotic disease genes. Each gene was encoded using KEGG and GO enrichment scores, which reflected how much a gene can affect this function. For each fibrotic disease, by comparing the KEGG and GO enrichment scores between reported disease genes and other genes using the Monte Carlo feature selection (MCFS) method, the key KEGG and GO features were identified. We compared the gene overlaps among eight fibrotic diseases and connective tissue growth factor (CTGF) was finally identified as the common key molecule. The key KEGG and GO features of the eight fibrotic diseases were all screened by MCFS method. Moreover, we interestingly found overlaps of pathways between renal fibrosis and skin fibrosis, such as GO:1901890-positive regulation of cell junction assembly, as well as common regulatory genes, such as CTGF, which is the key molecule regulating fibrogenesis. We hope to offer a new insight into the cellular and molecular mechanisms underlying fibrosis and therefore help leading to the development of new drugs, which specifically delay or even improve the symptoms of fibrosis.

20 citations


Journal ArticleDOI
TL;DR: In this paper, the authors applied machine learning models to identify some specific host biomarkers associated with COVID-19 infection on the basis of a publicly released transcriptomic dataset, which included healthy controls and patients with bacterial infection, influenza, COVID19, and other kinds of coronavirus.
Abstract: COVID-19, a severe respiratory disease caused by a new type of coronavirus SARS-CoV-2, has been spreading all over the world. Patients infected with SARS-CoV-2 may have no pathogenic symptoms, i.e., presymptomatic patients and asymptomatic patients. Both patients could further spread the virus to other susceptible people, thereby making the control of COVID-19 difficult. The two major challenges for COVID-19 diagnosis at present are as follows: (1) patients could share similar symptoms with other respiratory infections, and (2) patients may not have any symptoms but could still spread the virus. Therefore, new biomarkers at different omics levels are required for the large-scale screening and diagnosis of COVID-19. Although some initial analyses could identify a group of candidate gene biomarkers for COVID-19, the previous work still could not identify biomarkers capable for clinical use in COVID-19, which requires disease-specific diagnosis compared with other multiple infectious diseases. As an extension of the previous study, optimized machine learning models were applied in the present study to identify some specific qualitative host biomarkers associated with COVID-19 infection on the basis of a publicly released transcriptomic dataset, which included healthy controls and patients with bacterial infection, influenza, COVID-19, and other kinds of coronavirus. This dataset was first analysed by Boruta, Max-Relevance and Min-Redundancy feature selection methods one by one, resulting in a feature list. This list was fed into the incremental feature selection method, incorporating one of the classification algorithms to extract essential biomarkers and build efficient classifiers and classification rules. The capacity of these findings to distinguish COVID-19 with other similar respiratory infectious diseases at the transcriptomic level was also validated, which may improve the efficacy and accuracy of COVID-19 diagnosis.

17 citations


Journal ArticleDOI
TL;DR: In this article, the mediating effect of stress on the association between physical activity and sleep quality in Chinese college students, after controlling for age, nationality, and tobacco and alcohol use, was investigated.
Abstract: Background: While physical activity has been reported to positively affect stress and sleep quality, less is known about the potential relationships among them. The present study aimed to investigate the mediating effect of stress on the association between physical activity and sleep quality in Chinese college students, after controlling for age, nationality, and tobacco and alcohol use. Participants: The sample comprised 6973 college students representing three Chinese universities. Methods: Physical activity, perceived stress, and sleep quality were respectively measured using the International Physical Activity Questionnaire—Short Form (IPAQ-SF), Perceived Stress Scale—10 Items (PSS-10), and Pittsburgh Sleep Quality Index (PSQI). Results: Mediating effects of perceived stress on the association between physical activity and sleep quality were observed in males and females, with 42.4% (partial mediating effect) and 306.3% (complete mediating effect) as percentages of mediation, respectively. Conclusion: The results of this study may provide some suggestions that physical activity could improve sleep by aiding individuals in coping with stress and indicate that stress management might be an effective non-pharmaceutical therapy for sleep improvement.

16 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper applied several advanced computational methods to analyze the Cancer Cell Line Encyclopedia (CCLE) gene expression profiles which included 988 cell lines from 20 cancer types, and two feature selection methods, including Boruta, and max relevance and min redundancy methods, were applied to the cell line gene expression data one by one, generating a feature list.
Abstract: There are many types of cancers. Although they share some hallmarks, such as proliferation and metastasis, they are still very different from many perspectives. They grow on different organ or tissues. Does each cancer have a unique gene expression pattern that makes it different from other cancer types? After the Cancer Genome Atlas (TCGA) project, there are more and more pan-cancer studies. Researchers want to get robust gene expression signature from pan-cancer patients. But there is large variance in cancer patients due to heterogeneity. To get robust results, the sample size will be too large to recruit. In this study, we tried another approach to get robust pan-cancer biomarkers by using the cell line data to reduce the variance. We applied several advanced computational methods to analyze the Cancer Cell Line Encyclopedia (CCLE) gene expression profiles which included 988 cell lines from 20 cancer types. Two feature selection methods, including Boruta, and max-relevance and min-redundancy methods, were applied to the cell line gene expression data one by one, generating a feature list. Such list was fed into incremental feature selection method, incorporating one classification algorithm, to extract biomarkers, construct optimal classifiers and decision rules. The optimal classifiers provided good performance, which can be useful tools to identify cell lines from different cancer types, whereas the biomarkers (e.g. NCKAP1, TNFRSF12A, LAMB2, FKBP9, PFN2, TOM1L1) and rules identified in this work may provide a meaningful and precise reference for differentiating multiple types of cancer and contribute to the personalized treatment of tumors.

Journal ArticleDOI
TL;DR: In this article, the extracellular microRNA profiles on 11 cancer types and non-cancer were first analyzed by Boruta to extract important microRNAs, which were then evaluated by the Max-Relevance and Min-Redundancy feature selection method.
Abstract: Cancer is one of the most threatening diseases to humans. It can invade multiple significant organs, including lung, liver, stomach, pancreas, and even brain. The identification of cancer biomarkers is one of the most significant components of cancer studies as the foundation of clinical cancer diagnosis and related drug development. During the large-scale screening for cancer prevention and early diagnosis, obtaining cancer-related tissues is impossible. Thus, the identification of cancer-associated circulating biomarkers from liquid biopsy targeting has been proposed and has become the most important direction for research on clinical cancer diagnosis. Here, we analyzed pan-cancer extracellular microRNA profiles by using multiple machine-learning models. The extracellular microRNA profiles on 11 cancer types and non-cancer were first analyzed by Boruta to extract important microRNAs. Selected microRNAs were then evaluated by the Max-Relevance and Min-Redundancy feature selection method, resulting in a feature list, which were fed into the incremental feature selection method to identify candidate circulating extracellular microRNA for cancer recognition and classification. A series of quantitative classification rules was also established for such cancer classification, thereby providing a solid research foundation for further biomarker exploration and functional analyses of tumorigenesis at the level of circulating extracellular microRNA.

Journal ArticleDOI
TL;DR: The SVM classifier may serve as an important clinical tool to address the challenging task of differentiating between CHP and IPF and many of the biomarker genes on the differential co-expression network showed great promise in revealing the underlying mechanisms of CHP.
Abstract: Aims We would like to identify the biomarkers for chronic hypersensitivity pneumonitis (CHP) and facilitate the precise gene therapy of CHP. Background Chronic hypersensitivity pneumonitis (CHP) is an interstitial lung disease caused by hypersensitive reactions to inhaled antigens. Clinically, the tasks of differentiating between CHP and other interstitial lungs diseases, especially idiopathic pulmonary fibrosis (IPF), were challenging. Objective In this study, we analyzed the public available gene expression profile of 82 CHP patients, 103 IPF patients, and 103 control samples to identify the CHP biomarkers. Method The CHP biomarkers were selected with advanced feature selection methods: Monte Carlo Feature Selection (MCFS) and Incremental Feature Selection (IFS). A Support Vector Machine (SVM) classifier was built. Then, we analyzed these CHP biomarkers through functional enrichment analysis and differential co-expression analysis. Result There were 674 identified CHP biomarkers. The co-expression network of these biomarkers in CHP included more negative regulations and the network structure of CHP was quite different from the network of IPF and control. Conclusion The SVM classifier may serve as an important clinical tool to address the challenging task of differentiating between CHP and IPF. Many of the biomarker genes on the differential co-expression network showed great promise in revealing the underlying mechanisms of CHP.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper extracted functional enrichment features from GO and KEGG, and then used node2vec to learn functional embedding features of genes from a gene-gene network.
Abstract: Phenotype is one of the most significant concepts in genetics, which is used to describe all the characteristics of a research object that can be observed. Considering that phenotype reflects the integrated features of genotype and environment factors, it is hard to define phenotype characteristics, even difficult to predict unknown phenotypes. Restricted by current biological techniques, it is still quite expensive and time-consuming to obtain sufficient structural information of large-scale phenotype-associated genes/proteins. Various bioinformatics methods have been presented to solve such problem, and researchers have confirmed the efficacy and prediction accuracy of functional network-based prediction. But general functional descriptions have highly complicated inner structures for phenotype prediction. To further address this issue and improve the efficacy of phenotype prediction on more than ten kinds of phenotypes, we first extract functional enrichment features from GO and KEGG, and then use node2vec to learn functional embedding features of genes from a gene-gene network. All these features are analyzed by some feature selection methods (Boruta, minimum redundancy maximum relevance) to generate a feature list. Such list is fed into the incremental feature selection, incorporating some multi-label classifiers built by RAkEL and some classic base classifiers, to build an optimum multi-label multi-class classification model for phenotype prediction. According to recent researches, our method has indeed identified many literature-supported genes/proteins and their associated phenotypes, and even some candidate genes with re-assigned new phenotypes, which provide a new computational tool for the accurate and effective phenotypic prediction.

Journal ArticleDOI
TL;DR: In this article, the role of SNHG8 in EBV-associated gastric cancer (EBVaGC) was investigated, and the relationship between the expression levels of small nucleolar RNA host genes and clinical outcome in 61 EBVaGC cases was analyzed.
Abstract: SNHG8, a family member of small nucleolar RNA host genes (SNHG), has been reported to act as an oncogene in gastric carcinoma (GC). However, its biological function in Epstein-Barr virus (EBV)-associated gastric cancer (EBVaGC) remains unclear. This study investigated the role of SNHG8 in EBVaGC. Sixty-one cases of EBVaGC, 20 cases of non-EBV-infected gastric cancer (EBVnGC), and relative cell lines were studied for the expression of SNHG8 and BHRF1 (BCL2 homolog reading frame 1) encoded by EBV with Western blot and qRT-PCR assays. The relationship between the expression levels of SNHG8 and the clinical outcome in 61 EBVaGC cases was analyzed. Effects of overexpression or knockdown of BHRF1, SNHG8, or TRIM28 on cell proliferation, migration, invasion, and cell cycle and the related molecules were determined by several assays, including cell proliferation, colony assay, wound healing assay, transwell invasion assay, cell circle with flow cytometry, qRT-PCR, and Western blot for expression levels. The interactions among SNHG8, miR-512-5p, and TRIM28 were determined with Luciferase reporter assay, RNA immunoprecipitation (RIP), pull-down assays, and Western blot assay. The in vivo activity of SNHG8 was assessed with SNHG8 knockdown tumor xenografts in zebrafish. Results demonstrated that the following. (1) BHRF1 and SNHG8 were overexpressed in EBV-encoded RNA 1-positive EBVaGC tissues and cell lines. BHRF1 upregulated the expressions of SNHG8 and TRIM28 in AGS. (2) SNHG8 overexpression had a significant correlation with tumor size and vascular tumor thrombus. Patients with high SNHG8 expression had poorer overall survival (OS) compared to those with low SNHG8 expression. (3) SNHG8 overexpression promoted EBVaGC cell proliferation, migration, and invasion in vitro and in vivo, cell cycle arrested at the G2/M phase via the activation of BCL-2, CCND1, PCNA, PARP1, CDH1, CDH2 VIM, and Snail. (4) Results of dual-luciferase reporter assay, RNA immunoprecipitation, and pull-down assays indicated that SNHG8 sponged miR-512-5p, which targeted on TRIM28 and promoted cancer malignant behaviors of EBVaGC cells. Our data suggest that BHRF1 triggered the expression of SNHG8, which sponged miR-512-5p and upregulated TRIM28 and a set of effectors (such as BCL-2, CCND1, CDH1, CDH2 Snail, and VIM) to promote EBVaGC tumorigenesis and invasion. SNHG8 could be an independent prognostic factor for EBVaGC and sever as target for EBVaGC therapy.

Journal ArticleDOI
09 Sep 2021-Life
TL;DR: In this article, a computational method for distinguishing cell subtypes from the different pathological regions of non-small cell lung cancer on the basis of transcriptomic profiles, including a group of qualitative classification criteria (biomarkers) and various rules.
Abstract: Non-small cell lung cancer is a major lethal subtype of epithelial lung cancer, with high morbidity and mortality. The single-cell sequencing technique plays a key role in exploring the pathogenesis of non-small cell lung cancer. We proposed a computational method for distinguishing cell subtypes from the different pathological regions of non-small cell lung cancer on the basis of transcriptomic profiles, including a group of qualitative classification criteria (biomarkers) and various rules. The random forest classifier reached a Matthew's correlation coefficient (MCC) of 0.922 by using 720 features, and the decision tree reached an MCC of 0.786 by using 1880 features. The obtained biomarkers and rules were analyzed in the end of this study.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper employed Weighted gene co-expression network analysis (WGCNA) to build a gene interaction network using the expression profile of lung adenocarcinoma (LUAD) from The Cancer Genome Atlas (TCGA).
Abstract: Background With the advent of large-scale molecular profiling, an increasing number of oncogenic drivers contributing to precise medicine and reshaping classification of lung adenocarcinoma (LUAD) have been identified. However, only a minority of patients archived improved outcome under current standard therapies because of the dynamic mutational spectrum, which required expanding susceptible gene libraries. Accumulating evidence has witnessed that understanding gene regulatory networks as well as their changing processes was helpful in identifying core genes which acted as master regulators during carcinogenesis. The present study aimed at identifying key genes with differential correlations between normal and tumor status. Methods Weighted gene co-expression network analysis (WGCNA) was employed to build a gene interaction network using the expression profile of LUAD from The Cancer Genome Atlas (TCGA). R package DiffCorr was implemented for the identification of differential correlations between tumor and adjacent normal tissues. STRING and Cytoscape were used for the construction and visualization of biological networks. Results A total of 176 modules were detected in the network, among which yellow and medium orchid modules showed the most significant associations with LUAD. Then genes in these two modules were further chosen to evaluate their differential correlations. Finally, dozens of novel genes with opposite correlations including ATP13A4-AS1, HIGD1B, DAP3, and ISG20L2 were identified. Further biological and survival analyses highlighted their potential values in the diagnosis and treatment of LUAD. Moreover, real-time qPCR confirmed the expression patterns of ATP13A4-AS1, HIGD1B, DAP3, and ISG20L2 in LUAD tissues and cell lines. Conclusion Our study provided new insights into the gene regulatory mechanisms during transition from normal to tumor, pioneering a network-based algorithm in the application of tumor etiology.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a feature selection method, Max-Relevance and Min-Redundancy (mRMR), for each dataset and fed into the incremental feature selection (IFS), incorporating support vector machine (SVM) as the classification algorithm.
Abstract: Type 2 diabetes (T2D) is a systematic chronic metabolic condition with abnormal sugar metabolism dysfunction, and its complications are the most harmful to human beings and may be life-threatening after long-term durations. Considering the high incidence and severity at late stage, researchers have been focusing on the identification of specific biomarkers and potential drug targets for T2D at the genomic, epigenomic, and transcriptomic levels. Microbes participate in the pathogenesis of multiple metabolic diseases including diabetes. However, the related studies are still non-systematic and lack the functional exploration on identified microbes. To fill this gap between gut microbiome and diabetes study, we first introduced eggNOG database and KEGG ORTHOLOGY (KO) database for orthologous (protein/gene) annotation of microbiota. Two datasets with these annotations were employed, which were analyzed by multiple machine-learning models for identifying significant microbiota biomarkers of T2D. The powerful feature selection method, Max-Relevance and Min-Redundancy (mRMR), was first applied to the datasets, resulting in a feature list for each dataset. Then, the list was fed into the incremental feature selection (IFS), incorporating support vector machine (SVM) as the classification algorithm, to extract essential annotations and build efficient classifiers. This study not only revealed potential pathological factors for diabetes at the microbiome level but also provided us new candidates for drug development against diabetes.

Journal ArticleDOI
TL;DR: The authors' discovered signatures may provide an effective and precise transcriptomic reference to monitor EMT progression at the single-cell level and contribute to the exploration of detailed tumorigenesis mechanisms during EMT.
Abstract: Cancer, which refers to abnormal cell proliferative diseases with systematic pathogenic potential, is one of the leading threats to human health The final causes for patients' deaths are usually cancer recurrence, metastasis, and drug resistance against continuing therapy Epithelial-to-mesenchymal transition (EMT), which is the transformation of tumor cells (TCs), is a prerequisite for pathogenic cancer recurrence, metastasis, and drug resistance Conventional biomarkers can only define and recognize large tissues with obvious EMT markers but cannot accurately monitor detailed EMT processes In this study, a systematic workflow was established integrating effective feature selection, multiple machine learning models [Random forest (RF), Support vector machine (SVM)], rule learning, and functional enrichment analyses to find new biomarkers and their functional implications for distinguishing single-cell isolated TCs with unique epithelial or mesenchymal markers using public single-cell expression profiling Our discovered signatures may provide an effective and precise transcriptomic reference to monitor EMT progression at the single-cell level and contribute to the exploration of detailed tumorigenesis mechanisms during EMT

Journal ArticleDOI
TL;DR: In this paper, the role of microRNAs and circular RNAs in Nasopharyngeal Carcinoma (NPC) was investigated by constructing a circRNA-miRNA-mRNA co-expression network and performing differential expression analysis on mRNAs, miRNas, and circRNAs.
Abstract: Non-coding RNAs have been shown to play important regulatory roles, notably in cancer development. In this study, we investigated the role of microRNAs and circular RNAs in Nasopharyngeal Carcinoma (NPC) by constructing a circRNA-miRNA-mRNA co-expression network and performing differential expression analysis on mRNAs, miRNAs, and circRNAs. Specifically, the Epstein-Barr virus (EBV) infection has been found to be an important risk factor for NPC, and potential pathological differences may exist for EBV+ and EBV- subtypes of NPC. By comparing the expression profile of non-cancerous immortalized nasopharyngeal epithelial cell line and NPC cell lines, we identified differentially expressed coding and non-coding RNAs across three groups of comparison: cancer vs. non-cancer, EBV+ vs. EBV- NPC, and metastatic vs. non-metastatic NPC. We constructed a ceRNA network composed of mRNAs, miRNAs, and circRNAs, leveraging co-expression and miRNA target prediction tools. Within the network, we identified the regulatory ceRNAs of CDKN1B, ZNF302, ZNF268, and RPGR. These differentially expressed axis, along with other miRNA-circRNA pairs we identified through our analysis, helps elucidate the genetic and epigenetic changes central to NPC progression, and the differences between EBV+ and EBV- NPC.

Journal ArticleDOI
TL;DR: In this article, the authors used the protein-protein interaction network, functional annotation of proteins and a group of direct proteins with known subcellular localization to construct models, which can help promote the development of predictive technologies on sub-cellular localizations and provide a new approach for exploring the protein subcellsular localization patterns and their potential biological importance.
Abstract: Given the limitation of technologies, the subcellular localizations of proteins are difficult to identify. Predicting the subcellular localization and the intercellular distribution patterns of proteins in accordance with their specific biological roles, including validated functions, relationships with other proteins, and even their specific sequence characteristics, is necessary. The computational prediction of protein subcellular localizations can be performed on the basis of the sequence and the functional characteristics. In this study, the protein-protein interaction network, functional annotation of proteins and a group of direct proteins with known subcellular localization were used to construct models. To build efficient models, several powerful machine learning algorithms, including two feature selection methods, four classification algorithms, were employed. Some key proteins and functional terms were discovered, which may provide important contributions for determining protein subcellular locations. Furthermore, some quantitative rules were established to identify the potential subcellular localizations of proteins. As the first prediction model that uses direct protein annotation information (i.e., functional features) and STRING-based protein-protein interaction network (i.e., network features), our computational model can help promote the development of predictive technologies on subcellular localizations and provide a new approach for exploring the protein subcellular localization patterns and their potential biological importance.

Journal ArticleDOI
03 Jun 2021-Life
TL;DR: In this paper, a computational engine was developed to predict the features of antifreeze proteins and reveal the most important 39 features for AFP identification, such as ant-reeze-like/N-acetylneuraminic acid synthase C-terminal, insect AFP motif, C-type lectin-like, and EGF-like domain.
Abstract: Antifreeze protein (AFP) is a proteinaceous compound with improved antifreeze ability and binding ability to ice to prevent its growth. As a surface-active material, a small number of AFPs have a tremendous influence on the growth of ice. Therefore, identifying novel AFPs is important to understand protein–ice interactions and create novel ice-binding domains. To date, predicting AFPs is difficult due to their low sequence similarity for the ice-binding domain and the lack of common features among different AFPs. Here, a computational engine was developed to predict the features of AFPs and reveal the most important 39 features for AFP identification, such as antifreeze-like/N-acetylneuraminic acid synthase C-terminal, insect AFP motif, C-type lectin-like, and EGF-like domain. With this newly presented computational method, a group of previously confirmed functional AFP motifs was screened out. This study has identified some potential new AFP motifs and contributes to understanding biological antifreeze mechanisms.


Journal ArticleDOI
22 Apr 2021-PLOS ONE
TL;DR: In this article, the authors used gene methylation and expression features together and screened out the optimal features, including gene expression or methylation signatures, for fetal intolerance prediction for the first time.
Abstract: Pregnancy is a complicated and long procedure during one or more offspring development inside a woman. A short period of oxygen shortage after birth is quite normal for most babies and does not threaten their health. However, if babies have to suffer from a long period of oxygen shortage, then this condition is an indication of pathological fetal intolerance, which probably causes their death. The identification of the pathological fetal intolerance from the physical oxygen shortage is one of the important clinical problems in obstetrics for a long time. The clinical syndromes typically manifest five symptoms that indicate that the baby may suffer from fetal intolerance. At present, liquid biopsy combined with high-throughput sequencing or mass spectrum techniques provides a quick approach to detect real-time alteration in the peripheral blood at multiple levels with the rapid development of molecule sequencing technologies. Gene methylation is functionally correlated with gene expression; thus, the combination of gene methylation and expression information would help in screening out the key regulators for the pathogenesis of fetal intolerance. We combined gene methylation and expression features together and screened out the optimal features, including gene expression or methylation signatures, for fetal intolerance prediction for the first time. In addition, we applied various computational methods to construct a comprehensive computational pipeline to identify the potential biomarkers for fetal intolerance dependent on the liquid biopsy samples. We set up qualitative and quantitative computational models for the prediction for fetal intolerance during pregnancy. Moreover, we provided a new prospective for the detailed pathological mechanism of fetal intolerance. This work can provide a solid foundation for further experimental research and contribute to the application of liquid biopsy in antenatal care.

Journal ArticleDOI
TL;DR: A novel computational method is presented for the identification of the applicable and substantial blood gene signatures of IFX sensitivity by liquid biopsy, which may assist in the establishment of a clinical drug sensitivity test standard for RA and contribute to the revelation of unique IFX-associated pharmacological mechanisms.
Abstract: Rheumatoid arthritis (RA) is a severe chronic pathogenic inflammatory abnormality that damages small joints. Comprehensive diagnosis and treatment procedures for RA have been established because of its severe symptoms and relatively high morbidity. Medication and surgery are the two major therapeutic approaches. Infliximab (IFX) is a novel biological agent applied for the treatment of RA. IFX improves physical functions and benefits the achievement of clinical remission even under discontinuous medication. However, not all patients react to IFX, and distinguishing IFX-sensitive and IFX-resistant patients is quite difficult. Thus, how to predict the therapeutic effects of IFX on patients with RA is one of the urgent translational medicine problems in the clinical treatment of RA. In this study, we present a novel computational method for the identification of the applicable and substantial blood gene signatures of IFX sensitivity by liquid biopsy, which may assist in the establishment of a clinical drug sensitivity test standard for RA and contribute to the revelation of unique IFX-associated pharmacological mechanisms.


Journal ArticleDOI
TL;DR: This work provides a novel computational tool for immune cell quantitative subtyping and biomarker recognition by identifying different immune cell subtypes from the Immunological Genome Project (ImmGen).
Abstract: The immune system is a complicated defensive system that comprises multiple functional cells and molecules acting against endogenous and exogenous pathogenic factors. Identifying immune cell subtypes and recognizing their unique immunological functions are difficult because of the complicated cellular components and immunological functions of the immune system. With the development of transcriptomics and high-throughput sequencing, the gene expression profiling of immune cells can provide a new strategy to explore the immune cell subtyping. On the basis of the new profiling data of mouse immune cell gene expression from the Immunological Genome Project (ImmGen), a novel computational pipeline was applied to identify different immune cell subtypes, including αβ T cells, B cells, γδ T cells, and innate lymphocytes. First, the profiling data was analyzed by a powerful feature selection method, Monte-Carlo Feature Selection, resulting in a feature list and some informative features. For the list, the two-stage incremental feature selection method, incorporating random forest as the classification algorithm, was applied to extract essential gene signatures and build an efficient classifier. On the other hand, a rule learning scheme was applied on the informative features to construct quantitative expression rules. A group of gene signatures was found as qualitatively related to the biological processes of four immune cell subtypes. The quantitative expression rules can efficiently cluster immune cells. This work provides a novel computational tool for immune cell quantitative subtyping and biomarker recognition.