scispace - formally typeset
Search or ask a question
Author

Yuran Jia

Bio: Yuran Jia is an academic researcher. The author has contributed to research in topics: Computer science & Computational biology. The author has co-authored 1 publications.

Papers
More filters
Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed a DNA-binding protein identification method called KK-DBP, which fuses multiple PSSM features to improve prediction accuracy and achieved a prediction accuracy of 81.22%.
Abstract: DNA-binding protein (DBP) is a protein with a special DNA binding domain that is associated with many important molecular biological mechanisms. Rapid development of computational methods has made it possible to predict DBP on a large scale; however, existing methods do not fully integrate DBP-related features, resulting in rough prediction results. In this article, we develop a DNA-binding protein identification method called KK-DBP. To improve prediction accuracy, we propose a feature extraction method that fuses multiple PSSM features. The experimental results show a prediction accuracy on the independent test dataset PDB186 of 81.22%, which is the highest of all existing methods.

5 citations

Journal ArticleDOI
TL;DR: A variety of computational methods for complex genome assembly are reviewed and it is hoped that gapless, telomere-to-telomere and accurate assembly of complex genomes can be truly routinely achieved using only a simple process or a single tool in the future.
Abstract: High-quality genome chromosome-scale sequences provide an important basis for genomics downstream analysis, especially the construction of haplotype-resolved and complete genomes, which plays a key role in genome annotation, mutation detection, evolutionary analysis, gene function research, comparative genomics and other aspects. However, genome-wide short-read sequencing is difficult to produce a complete genome in the face of a complex genome with high duplication and multiple heterozygosity. The emergence of long-read sequencing technology has greatly improved the integrity of complex genome assembly. We review a variety of computational methods for complex genome assembly and describe in detail the theories, innovations and shortcomings of collapsed, semi-collapsed and uncollapsed assemblers based on long reads. Among the three methods, uncollapsed assembly is the most correct and complete way to represent genomes. In addition, genome assembly is closely related to haplotype reconstruction, that is uncollapsed assembly realizes haplotype reconstruction, and haplotype reconstruction promotes uncollapsed assembly. We hope that gapless, telomere-to-telomere and accurate assembly of complex genomes can be truly routinely achieved using only a simple process or a single tool in the future.

4 citations

Journal ArticleDOI
TL;DR: A novel stacking-based ensemble learning framework for Cas protein identification, called CRISPRCasStack, which can address the accuracy deficiencies and inefficiencies of the existing state-of-the-art tools.
Abstract: CRISPR-Cas system is an adaptive immune system widely found in most bacteria and archaea to defend against exogenous gene invasion. One of the most critical steps in the study of exploring and classifying novel CRISPR-Cas systems and their functional diversity is the identification of Cas proteins in CRISPR-Cas systems. The discovery of novel Cas proteins has also laid the foundation for technologies such as CRISPR-Cas-based gene editing and gene therapy. Currently, accurate and efficient screening of Cas proteins from metagenomic sequences and proteomic sequences remains a challenge. For Cas proteins with low sequence conservation, existing tools for Cas protein identification based on homology cannot guarantee identification accuracy and efficiency. In this paper, we have developed a novel stacking-based ensemble learning framework for Cas protein identification, called CRISPRCasStack. In particular, we applied the SHAP (SHapley Additive exPlanations) method to analyze the features used in CRISPRCasStack. Sufficient experimental validation and independent testing have demonstrated that CRISPRCasStack can address the accuracy deficiencies and inefficiencies of the existing state-of-the-art tools. We also provide a toolkit to accurately identify and analyze potential Cas proteins, Cas operons, CRISPR arrays and CRISPR-Cas locus in prokaryotic sequences. The CRISPRCasStack toolkit is available at https://github.com/yrjia1015/CRISPRCasStack.

1 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This work proposes a methodology named “DNAPred_Prot”, which uses various position and frequency-dependent features from protein sequences for efficient and effective prediction of DNA-binding proteins, and it can be predicted that the suggested methodology performs better than other extant methods.
Abstract: In the domain of genome annotation, the identification of DNA-binding protein is one of the crucial challenges. DNA is considered a blueprint for the cell. It contained all necessary information for building and maintaining the trait of an organism. It is DNA, which makes a living thing, a living thing. Protein interaction with DNA performs an essential role in regulating DNA functions such as DNA repair, transcription, and regulation. Identification of these proteins is a crucial task for understanding the regulation of genes. Several methods have been developed to identify the binding sites of DNA and protein depending upon the structures and sequences, but they were costly and time-consuming. Therefore, we propose a methodology named “DNAPred_Prot”, which uses various position and frequency-dependent features from protein sequences for efficient and effective prediction of DNA-binding proteins. Using testing techniques like 10-fold cross-validation and jackknife testing an accuracy of 94.95% and 95.11% was yielded, respectively. The results of SVM and ANN were also compared with those of a random forest classifier. The robustness of the proposed model was evaluated by using the independent dataset PDB186, and an accuracy of 91.47% was achieved by it. From these results, it can be predicted that the suggested methodology performs better than other extant methods for the identification of DNA-binding proteins.

5 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed a comprehensive computational model for plant specific DNA-binding proteins (DBPs) identification, where five shallow learning and six deep learning models were initially used for prediction, where shallow learning methods outperformed deep learning algorithms.
Abstract: DNA-binding proteins (DBPs) play crucial roles in numerous cellular processes including nucleotide recognition, transcriptional control and the regulation of gene expression. Majority of the existing computational techniques for identifying DBPs are mainly applicable to human and mouse datasets. Even though some models have been tested on Arabidopsis, they produce poor accuracy when applied to other plant species. Therefore, it is imperative to develop an effective computational model for predicting plant DBPs. In this study, we developed a comprehensive computational model for plant specific DBPs identification. Five shallow learning and six deep learning models were initially used for prediction, where shallow learning methods outperformed deep learning algorithms. In particular, support vector machine achieved highest repeated 5-fold cross-validation accuracy of 94.0% area under receiver operating characteristic curve (AUC-ROC) and 93.5% area under precision recall curve (AUC-PR). With an independent dataset, the developed approach secured 93.8% AUC-ROC and 94.6% AUC-PR. While compared with the state-of-art existing tools by using an independent dataset, the proposed model achieved much higher accuracy. Overall results suggest that the developed computational model is more efficient and reliable as compared to the existing models for the prediction of DBPs in plants. For the convenience of the majority of experimental scientists, the developed prediction server PlDBPred is publicly accessible at https://iasri-sg.icar.gov.in/pldbpred/.The source code is also provided at https://iasri-sg.icar.gov.in/pldbpred/source_code.php for prediction using a large-size dataset.

3 citations

Posted ContentDOI
01 Aug 2021-bioRxiv
TL;DR: The ILRA tool as discussed by the authors is a pipeline that orders, names, merges, and circularizes contigs, filters erroneous small contigs and contamination, and corrects homopolymer errors with Illumina reads.
Abstract: Recent advances in long read technologies not only enable large consortia to aim to sequence all eukaryotes on Earth, but they also allow many laboratories to sequence their species of interest. Although there is a promise to obtain 9perfect genomes9 with long read technologies, the number of contigs often exceeds the number of chromosomes significantly, containing many insertion and deletion errors around homopolymer tracks. To overcome these issues, we implemented ILRA to correct long reads-based assemblies, a pipeline that orders, names, merges, and circularizes contigs, filters erroneous small contigs and contamination, and corrects homopolymer errors with Illumina reads. We successfully tested our approach to assemble the genomes of four novel Plasmodium falciparum samples, and on existing assemblies of Trypanosoma brucei and Leptosphaeria spp. We found that correcting homopolymer tracks reduced the number of genes incorrectly annotated as pseudogenes, but an iterative correction seems to be needed to reduce high numbers of homopolymer errors. In summary, we described and compared the performance of a new tool, which improves the quality of long read assemblies. It can be used to correct genomes of a size of up to 300 Mb.

2 citations

Journal ArticleDOI
26 Mar 2023-Genes
TL;DR: An integrative literature review was carried out, searching articles in several sites, including: PUBMED, NCBI-PMC, and Google Academic, published in English, indexed in referenced databases and without a publication time filter, but prioritizing articles from the last 3 years as discussed by the authors .
Abstract: Precision and organization govern the cell cycle, ensuring normal proliferation. However, some cells may undergo abnormal cell divisions (neosis) or variations of mitotic cycles (endopolyploidy). Consequently, the formation of polyploid giant cancer cells (PGCCs), critical for tumor survival, resistance, and immortalization, can occur. Newly formed cells end up accessing numerous multicellular and unicellular programs that enable metastasis, drug resistance, tumor recurrence, and self-renewal or diverse clone formation. An integrative literature review was carried out, searching articles in several sites, including: PUBMED, NCBI-PMC, and Google Academic, published in English, indexed in referenced databases and without a publication time filter, but prioritizing articles from the last 3 years, to answer the following questions: (i) “What is the current knowledge about polyploidy in tumors?”; (ii) “What are the applications of computational studies for the understanding of cancer polyploidy?”; and (iii) “How do PGCCs contribute to tumorigenesis?”

1 citations

Posted ContentDOI
21 Feb 2023-bioRxiv
TL;DR: In this article , the authors proposed a method to identify Tyrosine nitration (NT) modification by extracting comprehensive features from raw protein sequences using four different sequence encoders.
Abstract: Post-translational modifications (PTMs) either enhance a protein’s activity in various sub-cellular processes, or degrade their activity which leads towards failure of intracellular processes. Tyrosine nitration (NT) modification degrades protein’s activity that initiate and propagate various diseases including Neurodegenerative, Cardiovascular, Autoimmune diseases, and Carcinogenesis. Identification of NT modification support development of novel therapies and drug discoveries for associated diseases. Identification of NT modification in biochemical labs is expensive, time consuming, and error-prone. To supplement this process, several computational approaches have been proposed. However these approaches remain fail to precisely identify NT modification, due to the extraction of irrelevant, redundant and less discriminative features from protein sequences. The paper in hand presents NTpred framework competent in extracting comprehensive features from raw protein sequences using four different sequence encoders. To reap the benefits of different encoders, it generates four additional feature spaces by fusing different combinations of individual encodings. Furthermore, it eradicates irrelevant and redundant features from eight different feature spaces through a Recursive Feature Elimination process. Selected features of four individual encodings and four feature fusion vectors are used to train eight different Gradient Boosted Tree classifiers. The probability scores from the trained classifiers are utilized to generate a new probabilistic feature space, that is utilized to train a Logistic Regression classifier. On BD1 benchmark dataset, the proposed framework outperform existing best performing predictor in 5-fold cross validation and independent test evaluation with combined improvement of 13.7% in MCC and 20.1% in AUC. Similarly, on BD2 benchmark dataset, the proposed framework outperform existing best performing predictor with combined improvement of 5.3% in MCC and 1.0% in AUC.