scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2022"


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors selected and selected five superior feature codes: CKSAAP, ACF, BLOSUM62, AAindex, and one-hot, according to their performance in the problem of succinylation sites prediction.
Abstract: The succinylation modification of protein participates in the regulation of a variety of cellular processes. Identification of modified substrates with precise sites is the basis for understanding the molecular mechanism and regulation of succinylation. In this work, we picked and chose five superior feature codes: CKSAAP, ACF, BLOSUM62, AAindex, and one-hot, according to their performance in the problem of succinylation sites prediction. Then, LSTM network and CNN were used to construct four models: LSTM-CNN, CNN-LSTM, LSTM, and CNN. The five selected features were, respectively, input into each of these four models for training to compare the four models. Based on the performance of each model, the optimal model among them was chosen to construct a hybrid model DeepSucc that was composed of five sub-modules for integrating heterogeneous information. Under the 10-fold cross-validation, the hybrid model DeepSucc achieves 86.26% accuracy, 84.94% specificity, 87.57% sensitivity, 0.9406 AUC, and 0.7254 MCC. When compared with other prediction tools using an independent test set, DeepSucc outperformed them in sensitivity and MCC. The datasets and source codes can be accessed at https://github.com/1835174863zd/DeepSucc.

4 citations


Journal ArticleDOI
TL;DR: In this article , a rational design of type-3 peptidic antagonists that can directly disrupt PPAR-coactivator interaction by physically competing with coactivator proteins for the CIS site was described.
Abstract: The peroxisome proliferator-activated receptor-[Formula: see text] (PPAR[Formula: see text]) is a member of PPAR nuclear receptor family, and its antagonists have been widely used to treat pediatric metabolic disorders. Traditional type-1 and type-2 PPAR[Formula: see text] antagonists are all small-molecule compounds that have been developed to target the ligand-binding site (LBS) of PPAR[Formula: see text], which is not overlapped with the coactivator-interacting site (CIS) of PPAR[Formula: see text]. In this study, we described the rational design of type-3 peptidic antagonists that can directly disrupt PPAR[Formula: see text]–coactivator interaction by physically competing with coactivator proteins for the CIS site. In the procedure, seven reported PPAR[Formula: see text] coactivator proteins were collected and eight 11-mer helical peptide segments that contain the core PPAR[Formula: see text]-binding LXXLL motif were identified in these coactivators, which, however, possessed a large flexibility and intrinsic disorder when splitting from coactivator protein context, and thus would incur a considerable entropy penalty (i.e. indirect readout) upon binding to PPAR[Formula: see text] CIS site. By carefully examining the natively folded conformation of these helical peptides in their parent protein context and in their interaction mode with the CIS site, we rationally designed a hydrocarbon bridge across the solvent-exposed, ([Formula: see text], [Formula: see text]+ 4) residues to constrain their helical conformation, thus largely minimizing the unfavorable indirect readout effect but having only a moderate influence on favorable enthalpy contribution (i.e. direct readout) upon PPAR[Formula: see text]–peptide binding. The computational findings were further substantiated by fluorescence competition assays.

2 citations


Journal ArticleDOI
TL;DR: If the SSA-generated dataset comprises runs with differing simulation times, "TemporalGSSA" can compute the time-dependent trajectory of a metabolite provided the trials-per technical replicate constraint is complied with.
Abstract: Whilst data on biochemical networks has increased several-fold, our comprehension of the underlying molecular biology is incomplete and inadequate. Simulation studies permit data collation from disparate time points and the imputed trajectories can provide valuable insights into the molecular biology of complex biochemical systems. Although, stochastic simulations are accurate, each run is an independent event and the data that is generated cannot be directly compared even with identical simulation times. This lack of robustness will preclude a biologically meaningful result for the metabolite(s) of concern and is a significant limitation of this approach. "TemporalGSSA" or temporal Gillespie Stochastic Simulation Algorithm is an R-wrapper which will collate and partition SSA-generated datasets with identical simulation times (trials) into finite sets of linear models (technical replicates). Each such model (time step of a single run, absolute number of molecules for a metabolite) computes several coefficients (slope, intercept, etc.). These coefficients are averaged (mean slope, mean intercept) across all trials of a technical replicate and along with an imputed time step (mean, median, random) is incorporated into a linear regression equation. The solution to this equation is the number of molecules of a metabolite which is used to compute the molar concentration of the metabolite per technical replicate. The summarized (mean, standard deviation) data of this vector of technical replicates is the outcome or numerical estimate of the molar concentration of a metabolite and is dependent on the duration of the simulation. If the SSA-generated dataset comprises runs with differing simulation times, "TemporalGSSA" can compute the time-dependent trajectory of a metabolite provided the trials-per technical replicate constraint is complied with. The algorithms deployed by "TemporalGSSA" are rigorous, have a sound theoretical basis and have contributed meaningfully to our comprehension of the mechanism(s) that drive complex biochemical systems. "TemporalGSSA", is robust, freely accessible and easy to use with several readily testable examples.

2 citations


Journal ArticleDOI
TL;DR: In this paper , a set of human aldose reductase inhibitors were selected for molecular dynamics (MD)-based binding free energy calculations and a total of 100 simulations were conducted for the ligand-receptor complexes followed by prediction of binding free energies using MM/PB(GB)SA and LIE approaches under different simulation time.
Abstract: The profound impact of in silico studies for a fast-paced drug discovery pipeline is undeniable for pharmaceutical community. The rational design of novel drug candidates necessitates considering optimization of their different aspects prior to synthesis and biological evaluations. The affinity prediction of small ligands to target of interest for rank-ordering the potential ligands is one of the most routinely used steps in the context of virtual screening. So, the end-point methods were employed for binding free energy estimation focusing on evaluating simulation time effect. Then, a set of human aldose reductase inhibitors were selected for molecular dynamics (MD)-based binding free energy calculations. A total of 100[Formula: see text]ns MD simulation time was conducted for the ligand-receptor complexes followed by prediction of binding free energies using MM/PB(GB)SA and LIE approaches under different simulation time. The results revealed that a maximum of 30[Formula: see text]ns simulation time is sufficient for determination of binding affinities inferred from steady trend of squared correlation values (R 2 ) between experimental and predicted [Formula: see text]G as a function of MD simulation time. In conclusion, the MM/PB(GB)SA algorithms performed well in terms of binding affinity prediction compared to LIE approach. The results provide new insights for large-scale applications of such predictions in an affordable computational cost.

1 citations


Journal ArticleDOI
TL;DR: The concept of biomarker pairs, which may be considered for clinical validation due to the high literature as well as experimental support, are demonstrated.
Abstract: Tetralogy of Fallot (TOF) is a cyanotic congenital condition contributed by genetic, epigenetic as well as environmental factors. We applied sparse machine learning algorithms to RNAseq and sRNAseq data to select the prospective biomarker candidates. Furthermore, we applied filtering techniques to identify a subset of biomarker pairs in TOF. Differential expression analysis disclosed 2757 genes and 214 miRNAs, which are dysregulated. Weighted gene co-expression network analysis on the differentially expressed genes extracted five significant modules that are enriched in GO terms, extracellular matrix, signaling and calcium ion binding. Also, voomNSC selected two genes and five miRNAs and transformed PLDA-predicted 72 genes and 38 miRNAs as prognostic biomarkers. Out of the selected biomarkers, miRNA target analysis revealed 14 miRNA-gene interactions. Also, 10 out of 14 pairs were oppositely expressed and four out of 10 oppositely expressed biomarker pairs shared common pathways of focal adhesion and P13K-Akt signaling. In conclusion, our study demonstrated the concept of biomarker pairs, which may be considered for clinical validation due to the high literature as well as experimental support.

1 citations


Journal ArticleDOI
TL;DR: In this article , the most common methods used to handle patient treatments and discuss certain caveats associated with each method are discussed, and the best approach to properly handle differences in patient treatment is specific to each individual situation.
Abstract: Clinical prediction models are widely used to predict adverse outcomes in patients, and are often employed to guide clinical decision-making. Clinical data typically consist of patients who received different treatments. Many prediction modeling studies fail to account for differences in patient treatment appropriately, which results in the development of prediction models that show poor accuracy and generalizability. In this paper, we list the most common methods used to handle patient treatments and discuss certain caveats associated with each method. We believe that proper handling of differences in patient treatment is crucial for the development of accurate and generalizable models. As different treatment strategies are employed for different diseases, the best approach to properly handle differences in patient treatment is specific to each individual situation. We use the Ma-Spore acute lymphoblastic leukemia data set as a case study to demonstrate the complexities associated with differences in patient treatment, and offer suggestions on incorporating treatment information during evaluation of prediction models. In clinical data, patients are typically treated on a case by case basis, with unique cases occurring more frequently than expected. Hence, there are many subtleties to consider during the analysis and evaluation of clinical prediction models.

1 citations


Journal ArticleDOI
TL;DR: A unique and next-generation database for bHLH transcription factors and made this database available to the world of science is created and it is believed that the database will be a valuable tool in future studies of the b HLH family.
Abstract: The basic helix loop helix (bHLH) superfamily is a large and diverse protein family that plays a role in various vital functions in nearly all animals and plants. The bHLH proteins form one of the largest families of transcription factors found in plants that act as homo- or heterodimers to regulate the expression of their target genes. The bHLH transcription factor is involved in many aspects of plant development and metabolism, including photomorphogenesis, light signal transduction, secondary metabolism, and stress response. The amount of molecular data has increased dramatically with the development of high-throughput techniques and wide use of bioinformatics techniques. The most efficient way to use this information is to store and analyze the data in a well-organized manner. In this study, all members of the bHLH superfamily in the plant kingdom were used to develop and implement a relational database. We have created a database called bHLHDB (www.bhlhdb.org) for the bHLH family members on which queries can be conducted based on the family or sequences information. The Hidden Markov Model (HMM), which is frequently used by researchers for the analysis of sequences, and the BLAST query were integrated into the database. In addition, the deep learning model was developed to predict the type of TF with only the protein sequence quickly, efficiently, and with 97.54% accuracy and 97.76% precision. We created a unique and next-generation database for bHLH transcription factors and made this database available to the world of science. We believe that the database will be a valuable tool in future studies of the bHLH family.

1 citations


Journal ArticleDOI
TL;DR: In this article , a two-layer prediction model with enhanced feature extraction strategy was proposed, which does feature combination from improved position-specific amino acid propensity (PSTKNC) method along with enhanced Nucleic acid composition (ENAC) and composition of k-spaced Nuclei acid pairs (CKSNAP), and then sent through a simple ANN to accurately identify enhancers in the first layer and their strength in the second layer.
Abstract: Enhancers are short regulatory DNA fragments that are bound with proteins called activators. They are free-bound and distant elements, which play a vital role in controlling gene expression. It is challenging to identify enhancers and their strength due to their dynamic nature. Although some machine learning methods exist to accelerate identification process, their prediction accuracy and efficiency will need more improvement. In this regard, we propose a two-layer prediction model with enhanced feature extraction strategy which does feature combination from improved position-specific amino acid propensity (PSTKNC) method along with Enhanced Nucleic Acid Composition (ENAC) and Composition of k-spaced Nucleic Acid Pairs (CKSNAP). The feature sets from all three feature extraction approaches were concatenated and then sent through a simple artificial neural network (ANN) to accurately identify enhancers in the first layer and their strength in the second layer. Experiments are conducted on benchmark chromatin nine cell lines dataset. A 10-fold cross validation method is employed to evaluate model’s performance. The results show that the proposed model gives an outstanding performance with 94.50%, 0.8903 of accuracy and Matthew’s correlation coefficient (MCC) in predicting enhancers and fairly does well with independent test also when compared with all other existing methods.

1 citations


Journal ArticleDOI
TL;DR: A new method is presented for predicting the number of clusters in a Robinson and Foulds (RF) distance matrix using a convolutional neural network (CNN) and a new CNN approach (called CNNTrees) is developed for multiple tree classification.
Abstract: The evolutionary histories of genes are susceptible of differing greatly from each other which could be explained by evolutionary variations in horizontal gene transfers or biological recombinations. A phylogenetic tree would therefore represent the evolutionary history of each gene, which may present different patterns from the species tree that defines the main evolutionary patterns. In addition, phylogenetic trees of closely related species should be merged, thus minimizing the topological conflicts they present and obtaining consensus trees (in the case of homogeneous data) or supertrees (in the case of heterogeneous data). The traditional approaches are consensus tree inference (if the set of trees contains the same set of species) or supertrees (if the set of trees contains different, but overlapping sets of species). Consensus trees and supertrees are constructed to produce unique trees. However, these methods lose precision with respect to different evolutionary variability. Other approaches have been implemented to preserve this variability using the [Formula: see text]-means algorithm or the [Formula: see text]-medoids algorithm. Using a new method, we determine all possible consensus trees and supertrees that best represent the most significant evolutionary models in a set of phylogenetic trees, thereby increasing the precision of the results and decreasing the time required. Results: This paper presents in detail a new method for predicting the number of clusters in a Robinson and Foulds (RF) distance matrix using a convolutional neural network (CNN). We developed a new CNN approach (called CNNTrees) for multiple tree classification. This new strategy returns a number of clusters of the input phylogenetic trees for different-size sets of trees, which makes the new approach more stable and more robust. The paper provides an in-depth analysis of the relevant, but very difficult, problem of constructing alternative supertrees using phylogenies with different but overlapping sets of taxa. This new model will play an important role in the inference of Trees of Life (ToL). Availability and implementation: CNNTrees is available through a web server at https://tahirinadia.github.io/. The source code, data and information about installation procedures are also available at https://github.com/TahiriNadia/CNNTrees. Supplementary information: Supplementary data are available on GitHub platform. The evolutionary history of species is not unique, but is specific to sets of genes. Indeed, each gene has its own evolutionary history that differs considerably from one gene to another. For example, some individual genes or operons may be affected by specific horizontal gene transfer and recombination events. Thus, the evolutionary history of each gene must be represented by its own phylogenetic tree, which may exhibit different evolutionary patterns than the species tree that accounts for the major vertical descent patterns. The result of traditional consensus tree or supertree inference methods is a single consensus tree or supertree. In this paper, we present in detail a new method for predicting the number of clusters in a Robinson and Foulds (RF) distance matrix using a convolutional neural network (CNN). We developed a new CNN approach (CNNTrees) to construct multiple tree classification. This new strategy returns a number of clusters in the order of the input trees, which allows this new approach to be more stable and also more robust.

1 citations


Journal ArticleDOI
TL;DR: In this article , the authors evaluate both alignment-dependent and alignment-independent methods to identify optimal strategies for analyzing mixed-species RNA sequencing data, and find that a more traditional approach of mixed-genome alignment followed by optimized separation of reads was the more successful with lower error rates.
Abstract: Gene expression studies using xenograft transplants or co-culture systems, usually with mixed human and mouse cells, have proven to be valuable to uncover cellular dynamics during development or in disease models. However, the mRNA sequence similarities among species presents a challenge for accurate transcript quantification. To identify optimal strategies for analyzing mixed-species RNA sequencing data, we evaluate both alignment-dependent and alignment-independent methods. Alignment of reads to a pooled reference index is effective, particularly if optimal alignments are used to classify sequencing reads by species, which are re-aligned with individual genomes, generating [Formula: see text] accuracy across a range of species ratios. Alignment-independent methods, such as convolutional neural networks, which extract the conserved patterns of sequences from two species, classify RNA sequencing reads with over 85% accuracy. Importantly, both methods perform well with different ratios of human and mouse reads. While non-alignment strategies successfully partitioned reads by species, a more traditional approach of mixed-genome alignment followed by optimized separation of reads proved to be the more successful with lower error rates.

1 citations


Journal ArticleDOI
Xiuquan Du1
TL;DR: DeepBtoD as mentioned in this paper proposes a multi-scale convolutional module embedded with a self-attentive mechanism, which is used to further learn more effective, diverse, and discriminative high-level features.
Abstract: RNA-binding proteins (RBPs) have crucial roles in various cellular processes such as alternative splicing and gene regulation. Therefore, the analysis and identification of RBPs is an essential issue. However, although many computational methods have been developed for predicting RBPs, a few studies simultaneously consider local and global information from the perspective of the RNA sequence. Facing this challenge, we present a novel method called DeepBtoD, which predicts RBPs directly from RNA sequences. First, a [Formula: see text]-BtoD encoding is designed, which takes into account the composition of [Formula: see text]-nucleotides and their relative positions and forms a local module. Second, we designed a multi-scale convolutional module embedded with a self-attentive mechanism, the ms-focusCNN, which is used to further learn more effective, diverse, and discriminative high-level features. Finally, global information is considered to supplement local modules with ensemble learning to predict whether the target RNA binds to RBPs. Our preliminary 24 independent test datasets show that our proposed method can classify RBPs with the area under the curve of 0.933. Remarkably, DeepBtoD shows competitive results across seven state-of-the-art methods, suggesting that RBPs can be highly recognized by integrating local [Formula: see text]-BtoD and global information only from RNA sequences. Hence, our integrative method may be useful to improve the power of RBPs prediction, which might be particularly useful for modeling protein-nucleic acid interactions in systems biology studies. Our DeepBtoD server can be accessed at http://175.27.228.227/DeepBtoD/ .

Journal ArticleDOI
TL;DR: RPfam as discussed by the authors refines the automatic alignment via scoring alignments based on the PFASUM matrix, restricting realignments within badly aligned blocks, optimizing the block scores by dynamic programming, and running refinements iteratively using the Simulated Annealing algorithm.
Abstract: High-quality multiple sequence alignments can provide insights into the architecture and function of protein families. The existing MSA tools often generate results inconsistent with biological distribution of conserved regions because of positioning amino acid residues and gaps only by symbols. We propose RPfam, a refiner towards curated-like MSAs for modeling the protein families in the Pfam database. RPfam refines the automatic alignments via scoring alignments based on the PFASUM matrix, restricting realignments within badly aligned blocks, optimizing the block scores by dynamic programming, and running refinements iteratively using the Simulated Annealing algorithm. Experiments show RPfam effectively refined the alignments produced by the MSA tools ClustalO and Muscle with reference to the curated seed alignments of the Pfam protein families. Especially RPfam improved the quality of the ClustalO alignments by 4.4% and the Muscle alignments by 2.8% on the gp32 DNA binding protein-like family. Supplementary Table is available at http://www.worldscinet.com/jbcb/ .

Journal ArticleDOI
TL;DR: In this paper , the authors employed a network expansion algorithm to evolve the metabolic network of the peri-implantation embryo metabolism and utilized flux balance analysis (FBA) to examine the viability of the evolved networks.
Abstract: Metabolism is an essential cellular process for the growth and maintenance of organisms. A better understanding of metabolism during embryogenesis may shed light on the developmental origins of human disease. Metabolic networks, however, are vastly complex with many redundant pathways and interconnected circuits. Thus, computational approaches serve as a practical solution for unraveling the genetic basis of embryo metabolism to help guide future experimental investigations. RNA-sequencing and other profiling technologies make it possible to elucidate metabolic genotype-phenotype relationships and yet our understanding of metabolism is limited. Very few studies have examined the temporal or spatial metabolomics of the human embryo, and prohibitively small sample sizes traditionally observed in human embryo research have presented logistical challenges for metabolic studies, hindering progress towards the reconstruction of the human embryonic metabolome. We employed a network expansion algorithm to evolve the metabolic network of the peri-implantation embryo metabolism and we utilized flux balance analysis (FBA) to examine the viability of the evolved networks. We found that modulating oxygen uptake promotes lactate diffusion across the outer mitochondrial layer, providing in-silico support for a proposed lactate-malate-aspartate shuttle. We developed a stage-specific model to serve as a proof-of-concept for the reconstruction of future metabolic models of development. Our work shows that it is feasible to model human metabolism with respect to time-dependent changes characteristic of peri-implantation development.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a two-stage denoising method, which is based on the SEM noise model and traditional variance stabilisation strategy, and combines the attention mechanism to achieve efficient noise removal.
Abstract: Scanning electron microscopy (SEM) is of great significance for analyzing the ultrastructure. However, due to the requirements of data throughput and electron dose of biological samples in the imaging process, the SEM image of biological samples is often occupied by noise which severely affects the observation of ultrastructure. Therefore, it is necessary to analyze and establish a noise model of SEM and propose an effective denoising algorithm that can preserve the ultrastructure. We first investigated the noise source of SEM images and introduced a signal-related SEM noise model. Then, we validated the effectiveness of the noise model through experiments, which are designed with standard samples to reflect the relation between real signal intensity and noise. Based on the SEM noise model and traditional variance stabilization denoising strategy, we proposed a novel, two-stage denoising method. In the first stage variance stabilization, our VS-Net realizes the separation of signal-dependent noise and signal in the SEM image. In the second stage denoising, our D-Net employs the structure of U-Net and combines the attention mechanism to achieve efficient noise removal. Compared with other existing denoising methods for SEM images, our proposed method is more competitive in objective evaluation and visual effects. Source code is available on GitHub (https://github.com/VictorCSheng/VSID-Net).

Journal ArticleDOI
TL;DR: The Journal of Bioinformatics and Computational Biology VOL. 20, No. 6, 2299001 (2022) Free AccessAuthor Index Volume 20 (20)https://doi.org/10.1142/S0219720022990013Cited by:0 Previous AboutSectionsPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack CitationsRecommend to Library Share onFacebookTwitterLinked InRedditEmail FiguresReferencesRelatedDetails Recommended Vol.
Abstract: Journal of Bioinformatics and Computational BiologyVol. 20, No. 06, 2299001 (2022) Free AccessAuthor Index Volume 20 (2022)https://doi.org/10.1142/S0219720022990013Cited by:0 Previous AboutSectionsPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack CitationsRecommend to Library ShareShare onFacebookTwitterLinked InRedditEmail FiguresReferencesRelatedDetails Recommended Vol. 20, No. 06 Metrics History PDF download

Journal ArticleDOI
TL;DR: The proposed feature engineering algorithm SiaCo was comprehensively evaluated using both transcriptome and methylome datasets and showed that Sia co features with improved classification accuracies for binary classification problems, and achieved improvements on the independent test dataset.
Abstract: Modern biotechnologies have generated huge amount of OMIC data, among which transcriptomes and methylomes are two major OMIC types. Transcriptomes measure the expression levels of all the transcripts while methylomes depict the cytosine methylation levels across a genome. Both OMIC data types could be generated by array or sequencing. And some studies deliver many more features (the number of features is denoted as [Formula: see text]) for a sample than the number [Formula: see text] of samples in a cohort, which induce the "large [Formula: see text] small [Formula: see text]" paradigm. This study focused on the classification problem about OMIC with "large [Formula: see text] small [Formula: see text]" paradigm. A Siamese convolutional network was utilized to transform the OMIC features into a new space with minimized intra-class distances and maximized inter-class distances between the samples. The proposed feature engineering algorithm SiaCo was comprehensively evaluated using both transcriptome and methylome datasets. The experimental data showed that SiaCo generated SiaCo features with improved classification accuracies for binary classification problems, and achieved improvements on the independent test dataset. The individual SiaCo features did not show better inter-class discrimination powers than the original OMIC features. This may be due to that the Siamese convolutional network optimized the collective performances of the SiaCo features, instead of the individual feature's discrimination power. The inherent transformation nature of the Siamese twin network also makes the SiaCo features lack of interpretability. The source code of SiaCo is freely available at http://www.healthinformaticslab.org/supp/resources.php.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed Feedback-AVPGAN, a system that aims to computationally generate novel antiviral peptides (AVPs) using the key premise of the Generative Adversarial Network (GAN) model and the Feedback method.
Abstract: In this study, we propose Feedback-AVPGAN, a system that aims to computationally generate novel antiviral peptides (AVPs). This system relies on the key premise of the Generative Adversarial Network (GAN) model and the Feedback method. GAN, a generative modeling approach that uses deep learning methods, comprises a generator and a discriminator. The generator is used to generate peptides; the generated proteins are fed to the discriminator to distinguish between the AVPs and non-AVPs. The original GAN design uses actual data to train the discriminator. However, not many AVPs have been experimentally obtained. To solve this problem, we used the Feedback method to allow the discriminator to learn from the existing as well as generated synthetic data. We implemented this method using a classifier module that classifies each peptide sequence generated by the GAN generator as AVP or non-AVP. The classifier uses the transformer network and achieves high classification accuracy. This mechanism enables the efficient generation of peptides with a high probability of exhibiting antiviral activity. Using the Feedback method, we evaluated various algorithms and their performance. Moreover, we modeled the structure of the generated peptides using AlphaFold2 and determined the peptides having similar physicochemical properties and structures to those of known AVPs, although with different sequences.

Journal ArticleDOI
TL;DR: In this paper , a novel network construction algorithm for identifying early warning network signals (IEWNS) is proposed for improving the performance of lung adenocarcinoma (LUAD) early diagnosis.
Abstract: Lung adenocarcinoma (LUAD) seriously threatens human health and generally results from dysfunction of relevant module molecules, which dynamically change with time and conditions, rather than that of an individual molecule. In this study, a novel network construction algorithm for identifying early warning network signals (IEWNS) is proposed for improving the performance of LUAD early diagnosis. To this end, we theoretically derived a dynamic criterion, namely, the relationship of variation (RV), to construct dynamic networks. RV infers correlation [Formula: see text] statistics to measure dynamic changes in molecular relationships during the process of disease development. Based on the dynamic networks constructed by IEWNS, network warning signals used to represent the occurrence of LUAD deterioration can be defined without human intervention. IEWNS was employed to perform a comprehensive analysis of gene expression profiles of LUAD from The Cancer Genome Atlas (TCGA) database and the Gene Expression Omnibus (GEO) database. The experimental results suggest that the potential biomarkers selected by IEWNS can facilitate a better understanding of pathogenetic mechanisms and help to achieve effective early diagnosis of LUAD. In conclusion, IEWNS provides novel insight into the initiation and progression of LUAD and helps to define prospective biomarkers for assessing disease deterioration.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a tensor nuclear norm (TNN) to preserve the heterogeneous structure in the low-rank information, and extended it to the tensor robust principal component analysis model.
Abstract: Tensor Robust Principal Component Analysis (TRPCA) has achieved promising results in the analysis of genomics data. However, the TRPCA model under the existing tensor singular value decomposition ([Formula: see text]-SVD) framework insufficiently extracts the potential low-rank structure of the data, resulting in suboptimal restored components. Simultaneously, the tensor nuclear norm (TNN) defined based on [Formula: see text]-SVD uses the same standard to handle various singular values. TNN ignores the difference of singular values, leading to the failure of the main information that needs to be well preserved. To preserve the heterogeneous structure in the low-rank information, we propose a novel TNN and extend it to the TRPCA model. Potential low-rank space may contain important information. We learn the low-rank structural information from the core tensor. The singular value space contains the association information between genes and cancers. The [Formula: see text]-shrinkage generalized threshold function is utilized to preserve the low-rank properties of larger singular values. The optimization problem is solved by the alternating direction method of the multiplier (ADMM) algorithm. Clustering and feature selection experiments are performed on the TCGA data set. The experimental results show that the proposed model is more promising than other state-of-the-art tensor decomposition methods.

Journal ArticleDOI
TL;DR: In this paper , the authors attempted to ascertain the minimal sequence requirement (MSR) around the central acetyl-lysine residue of SIRT1 substrate recognition sites as well as the amino acid preference (AAP) at different residues of the MSR window through quantitative structure-activity relationship (QSAR) strategy.
Abstract: Sirtuin 1 (SIRT1) is a nicotinamide adenine dinucleotide ([Formula: see text]-dependent deacetylase involved in multiple glucose metabolism pathways and plays an important role in the pathogenesis of diabetes mellitus (DM). The enzyme specifically recognizes its deacetylation substrates' peptide segments containing a central acetyl-lysine residue as well as a number of amino acids flanking the central residue. In this study, we attempted to ascertain the minimal sequence requirement (MSR) around the central acetyl-lysine residue of SIRT1 substrate-recognition sites as well as the amino acid preference (AAP) at different residues of the MSR window through quantitative structure-activity relationship (QSAR) strategy, which would benefit our understanding of SIRT1 substrate specificity at the molecular level and is also helpful to rationally design substrate-mimicking peptidic agents against DM by competitively targeting SIRT1 active site. In this procedure, a large-scale dataset containing 6801 13-mer acetyl-lysine peptides (and their SIRT1-catalyized deacetylation activities) were compiled to train 10 QSAR regression models developed by systematic combination of machine learning methods (PLS and SVM) and five amino acids descriptors (DPPS, T-scale, MolSurf, [Formula: see text]-score, and FASGAI). The two best QSAR models (PLS+FASGAI and SVM+DPPS) were then employed to statistically examine the contribution of residue positions to the deacetylation activity of acetyl-lysine peptide substrates, revealing that the MSR can be represented by 5-mer acetyl-lysine peptides that meet a consensus motif [Formula: see text][Formula: see text][Formula: see text](AcK)0[Formula: see text]. Structural analysis found that the [Formula: see text] and (AcK)0 residues are tightly packed against the enzyme active site and confer both stability and specificity for the enzyme-substrate complex, whereas the [Formula: see text], [Formula: see text] and [Formula: see text] residues are partially exposed to solvent but can also effectively stabilize the complex system. Subsequently, a systematic deacetylation activity change profile (SDACP) was created based on QSAR modeling, from which the AAP for each residue position of MSR was depicted. With the profile, we were able to rationally design an SDACP combinatorial library with promising deacetylation activity, from which nine MSR acetyl-lysine peptides as well as two known SIRT1 acetyl-lysine peptide substrates were tested by using SIRT1 deacetylation assay. It is revealed that the designed peptides exhibit a comparable or even higher activity than the controls, although the former is considerably shorter than the latter.

Journal ArticleDOI
Jianli Liu1
TL;DR: In this paper , a long short-term memory network (LSTM) was used to predict the nucleosome dynamic intervals (NDIs) of yeast 16 chromosome using time series data (TSD).
Abstract: Nucleosome localization is a dynamic process and consists of nucleosome dynamic intervals (NDIs). We preprocessed nucleosome sequence data as time series data (TSD) and developed a long short-term memory network (LSTM) model for training time series data (TSD; LSTM-TSD model) using iterative training and feature learning that predicts NDIs with high accuracy. Sn, Sp, Acc, and MCC of the obtained LSTM model is 91.88%, 92.72%, 92.30%, and 84.61%, respectively. LSTM model could precisely predict the NDIs of yeast 16 chromosome. The NDIs contain 90.29% of nucleosome core DNA and 91.20% of nucleosome central sites, indicating that NDIs have high confidence. We found that the binding sites of transcriptional proteins and other proteins are outside NDIs, not in NDIs. These results are important for analysis of nucleosome localization and gene transcriptional regulation.

Journal ArticleDOI
TL;DR: In this article , the GI algorithm is used to classify the enzymatic activity of the spots in a zymography image, which is done by visual analysis, which makes it a subjective process.
Abstract: Gel zymography quantifies the activity of certain enzymes in tumor processes. These enzymes are widely used in medical diagnosis. In order to analyze them, experts classify the zymography spots into various classes according to their tonalities. This classification is done by visual analysis, which is what makes it a subjective process. This work proposes a methodology to carry out this classifications with a process that involves an unsupervised learning algorithm in the images, denoted as the GI algorithm. With the experiments shown in this paper, this methodology could constitute a tool that bioinformatics scientists can trust to perform the desired classification since it is a quantitative indicator to order the enzymatic activity of the spots in a zymography.

Journal ArticleDOI
TL;DR: A novel chromosome segmentation algorithm to decompose overlapped chromosomes is proposed and a CNN-based classifier which outperforms all the existing classifiers is proposed which improves the classification results.
Abstract: Karyotype is a genetic test that is used for detection of chromosomal defects. In a karyotype test, an image is captured from chromosomes during the cell division. The captured images are then analyzed by cytogeneticists in order to detect possible chromosomal defects. In this paper, we have proposed an automated pipeline for analysis of karyotype images. There are three main steps for karyotype image analysis: image enhancement, image segmentation and chromosome classification. In this paper, we have proposed a novel chromosome segmentation algorithm to decompose overlapped chromosomes. We have also proposed a CNN-based classifier which outperforms all the existing classifiers. Our classifier is trained by a dataset of about 1,62,000 human chromosome images. We also introduced a novel post-processing algorithm which improves the classification results. The success rate of our segmentation algorithm is 95%. In addition, our experimental results show that the accuracy of our classifier for human chromosomes is 92.63% and our novel post-processing algorithm increases the classification results to 94%.

Journal ArticleDOI
TL;DR: In this paper , the authors evaluated the altered mRNA expression profiles of 27 RNA modification enzymes and compared the differences in tumor microenvironment (TME) and clinical prognosis between two RNA modification patterns using unsupervised clustering.
Abstract: Background: RNA adenosine modifications are crucial for regulating RNA levels. N6-methyladenosine (m6A), N1-methyladenosine (m1A), adenosine-to-inosine RNA editing, and alternative polyadenylation (APA) are four major RNA modification types. Methods: We evaluated the altered mRNA expression profiles of 27 RNA modification enzymes and compared the differences in tumor microenvironment (TME) and clinical prognosis between two RNA modification patterns using unsupervised clustering. Then, we constructed a scoring system, WM_score, and quantified the RNA modifications in patients of gastric cancer (GC), associating WM_score with TME, clinical outcomes, and effectiveness of targeted therapies. Results: RNA adenosine modifications strongly correlated with TME and could predict the degree of TME cell infiltration, genetic variation, and clinical prognosis. Two modification patterns were identified according to high and low WM_scores. Tumors in the WM_score-high subgroup were closely linked with survival advantage, CD4- T-cell infiltration, high tumor mutation burden, and cell cycle signaling pathways, whereas those in the WM_score-low subgroup showed strong infiltration of inflammatory cells and poor survival. Regarding the immunotherapy response, a high WM_score showed a significant correlation with PD-L1 expression, predicting the effect of PD-L1 blockade therapy. Conclusion: The WM_scoring system could facilitate scoring and prediction of GC prognosis.

Journal ArticleDOI
TL;DR: A prediction model based on extreme gradient boosting (XGBoost) algorithm was constructed by integrating gene expression data of different cancer cell lines, targets information of natural compounds and drug response data to determine synergistic compound combinations from complex components of TCM.
Abstract: Traditional Chinese medicine (TCM) is characterized by synergistic therapeutic effect involving multiple compounds and targets, which provide potential new therapy for the treatment of complex cancer conditions. However, the main contributors and the underlying mechanisms of synergistic TCM cancer therapies remain largely undetermined. Machine learning now provides a new approach to determine synergistic compound combinations from complex components of TCM. In this study, a prediction model based on extreme gradient boosting (XGBoost) algorithm was constructed by integrating gene expression data of different cancer cell lines, targets information of natural compounds and drug response data. Radix Paeoniae Rubra (RPR) was selected as a model herbal sample to evaluate the reliability of the constructed model. The optimal XGBoost prediction model achieved a good performance with Mean Square Error (MSE) of 0.66, Mean Absolute Error (MAE) of 0.61, and the Root Mean Squared Error (RMSE) of 0.81 on test dataset. The superior synergistic anti-tumor combinations of D15 (Paeonol[Formula: see text][Formula: see text][Formula: see text]Ethyl gallate) and D13 (Paeoniflorin[Formula: see text][Formula: see text][Formula: see text]Paeonol) were successfully predicted from RPR and experimentally validated on MCF-7 cells. Moreover, the combination of D13 could work as a main contributor to a synergistic anti-proliferative activity in the compatibility of RPR and Cortex Moutan (CM). Our XGBoost model could be a reliable tool for the efficient prediction of synergistic anti-tumor multi-compound combinations from TCM.

Journal ArticleDOI
TL;DR: Wei et al. as mentioned in this paper proposed a refiner towards curated-like multiple sequence alignments of the Pfam protein families, J Bioinform Comput Biol, 2022, https://doi.org/10.1142/S0219720022400042.
Abstract: Journal of Bioinformatics and Computational BiologyVol. 20, No. 04, 2202001 (2022) Special Issue: Selected Papers from InCoB 2021Guest Editor: Yun ZhengNo AccessIntroduction to Selected Papers from InCoB 2021Yun ZhengYun ZhengState Key Laboratory of Primate Biomedical Research, Institute of Primate Translational Medicine, Kunming University of Science and Technology, Kunming, Yunnan 650500, P. R. Chinahttps://doi.org/10.1142/S0219720022020012Cited by:0 Next AboutSectionsPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack CitationsRecommend to Library ShareShare onFacebookTwitterLinked InRedditEmail References 1. Wei Q, Zou H, Zhong C, Xu J , RPfam: A refiner towards curated-like multiple sequence alignments of the Pfam protein families, J Bioinform Comput Biol, 2022, https://doi.org/10.1142/S0219720022400029. Link, Google Scholar2. Charles S, Sreekumar J, Natarajan J , Transcriptomic meta-analysis reveals biomarker pairs and key pathways in Tetralogy of Fallot, J Bioinform Comput Biol, 2022, https://doi.org/10.1142/S0219720022400042. Link, Google Scholar FiguresReferencesRelatedDetails Recommended Vol. 20, No. 04 Metrics History Published: 3 August 2022 PDF download

Journal ArticleDOI
TL;DR: In this article , a machine learning-based phylogenetic tree generation model based on agglomerative clustering (PTGAC) was proposed for protein sequences considering all known chemical properties of amino acids.
Abstract: This work proposes a machine learning-based phylogenetic tree generation model based on agglomerative clustering (PTGAC) that compares protein sequences considering all known chemical properties of amino acids. The proposed model can serve as a suitable alternative to the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), which is inherently time-consuming in nature. Initially, principal component analysis (PCA) is used in the proposed scheme to reduce the dimensions of 20 amino acids using seven known chemical characteristics, yielding 20 TP (Total Points) values for each amino acid. The approach of cumulative summing is then used to give a non-degenerate numeric representation of the sequences based on these 20 TP values. A special kind of three-component vector is proposed as a descriptor, which consists of a new type of non-central moment of orders one, two, and three. Subsequently, the proposed model uses Euclidean Distance measures among the descriptors to create a distance matrix. Finally, a phylogenetic tree is constructed using hierarchical agglomerative clustering based on the distance matrix. The results are compared with the UPGMA and other existing methods in terms of the quality and time of constructing the phylogenetic tree. Both qualitative and quantitative analysis are performed as key assessment criteria for analyzing the performance of the proposed model. The qualitative analysis of the phylogenetic tree is performed by considering rationalized perception, while the quantitative analysis is performed based on symmetric distance (SD). On both criteria, the results obtained by the proposed model are more satisfactory than those produced earlier on the same species by other methods. Notably, this method is found to be efficient in terms of both time and space requirements and is capable of dealing with protein sequences of varying lengths.

Journal ArticleDOI
TL;DR: In this article , computational methods were used in order to verify whether peptide drug inhibitors are good drug candidates against the ubiquitin protein, UBE2C by conducting docking, MD and MMPBSA analyses.
Abstract: The World Health Organization (WHO) declared breast cancer (BC) as the most prevalent cancer in the world. With its prevalence and severity, there have been several breakthroughs in developing treatments for the disease. Targeted therapy treatments limit the damage done to healthy tissues. These targeted therapies are especially potent for luminal and HER-2 positive type breast cancer. However, for triple negative breast cancer (TNBC), the lack of defining biomarkers makes it hard to approach with targeted therapy methods. Protein-protein interactions (PPIs) have been studied as possible targets for drug action. However, small molecule drugs are not able to cover the entirety of the PPI binding interface. Peptides were found to be more suited to the large or flat PPI surfaces, in addition to their better pharmacokinetic properties. In this study, computational methods was used in order to verify whether peptide drug inhibitors are good drug candidates against the ubiquitin protein, UBE2C by conducting docking, MD and MMPBSA analyses. Results show that while the lead peptide, T20-M shows good potential as a peptide drug, its binding affinity towards UBE2C is not enough to overcome the natural UBE2C-ANAPC2 interaction. Further studies on modification of T20-M and the analysis of other peptide leads are recommended.

Journal ArticleDOI
TL;DR: A predictor called iRNA5 hmC-HOC is developed, which is based on a high-order correlation information method to identify RNA 5[Formula: see text]hmC modification sites, and indicates that the proposed method might be a promising tool in identifying RNA 5-hydroxymethylcytosine modification sites.
Abstract: RNA 5-hydroxymethylcytosine (5 hmC) is an important RNA modification, which plays vital role in several biological processes. Currently, it is a hot topic to identify 5[Formula: see text]hmC sites due to its benefit in understanding its biological functions. Therefore, in this study, we developed a predictor called iRNA5 hmC-HOC, which is based on a high-order correlation information method to identify 5[Formula: see text]hmC sites. To build the model, 22 different classes of dinucleotide physicochemical (PC) properties were employed to represent RNA sequences, and the least absolute shrinkage and selection operator (LASSO) algorithm was adopted to select the most discriminative features. In the jackknife test, the proposed method achieved 89.80% classification accuracy based on support vector machine (SVM). As compared with the state-of-the-art predictors, our proposed method has significant improvement on the classification performance. It indicates that the proposed method might be a promising tool in identifying RNA 5[Formula: see text]hmC modification sites. The dataset and source codes are available at https://figshare.com/articles/online_resource/iRNA5hmC-HOC/15177450.

Journal ArticleDOI
TL;DR: In this paper , an original software-implemented numerical methodology used to determine the effect of mutations on binding to small chemical molecules, on the example of gefitinib, AMPPNP, CO-1686, ASP8273, erlotinib binding with EGFR protein, and imatinib Binding with PPARgamma.
Abstract: In this paper, the authors present and describe, in detail, an original software-implemented numerical methodology used to determine the effect of mutations on binding to small chemical molecules, on the example of gefitinib, AMPPNP, CO-1686, ASP8273, erlotinib binding with EGFR protein, and imatinib binding with PPARgamma. Furthermore, the developed numerical approach makes it possible to determine the stability of a molecular complex, which consists of a protein and a small chemical molecule. The description of the software package that implements the presented algorithm is given in the website: https://binomlabs.com/ .