scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2021"


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper used discrete cosine transform (DCT) and support vector machine (SVM) to detect double-stranded DNA-binding proteins (dsDBPs).
Abstract: DNA-binding proteins (DBPs) perform an influential role in diverse biological activities like DNA replication, slicing, repair, and transcription. Some DBPs are indispensable for understanding many types of human cancers (i.e. lung, breast, and liver cancer) and chronic diseases (i.e. AIDS/HIV, asthma), while other kinds are involved in antibiotics, steroids, and anti-inflammatory drugs designing. These crucial processes are closely related to DBPs types. DBPs are categorized into single-stranded DNA-binding proteins (ssDBPs) and double-stranded DNA-binding proteins (dsDBPs). Few computational predictors have been reported for discriminating ssDBPs and dsDBPs. However, due to the limitations of the existing methods, an intelligent computational system is still highly desirable. In this work, features from protein sequences are discovered by extending the notion of dipeptide composition (DPC), evolutionary difference formula (EDF), and K-separated bigram (KSB) into the position-specific scoring matrix (PSSM). The highly intrinsic information was encoded by a compression approach named discrete cosine transform (DCT) and the model was trained with support vector machine (SVM). The prediction performance was further boosted by the genetic algorithm (GA) ensemble strategy. The novel predictor (DBP-GAPred) acquired 1.89%, 0.28%, and 6.63% higher accuracies on jackknife, 10-fold, and independent dataset tests, respectively than the best predictor. These outcomes confirm the superiority of our method over the existing predictors.

18 citations


Journal ArticleDOI
TL;DR: In this article, an improved 3D version of the U-Net model based on the dice loss function is utilized to predict the binding sites accurately, and the performance of the proposed model on the independent test datasets and SARS-COV-2 shows the segmentation model could predict the bound sites with a more accurate shape than the recently published deep learning model, i.e. DeepSite.
Abstract: Binding site prediction for new proteins is important in structure-based drug design. The identified binding sites may be helpful in the development of treatments for new viral outbreaks in the world when there is no information available about their pockets with COVID-19 being a case in point. Identification of the pockets using computational methods, as an alternative method, has recently attracted much interest. In this study, the binding site prediction is viewed as a semantic segmentation problem. An improved 3D version of the U-Net model based on the dice loss function is utilized to predict the binding sites accurately. The performance of the proposed model on the independent test datasets and SARS-COV-2 shows the segmentation model could predict the binding sites with a more accurate shape than the recently published deep learning model, i.e. DeepSite. Therefore, the model may help predict the binding sites of proteins and could be used in drug design for novel proteins.

7 citations


Journal ArticleDOI
TL;DR: Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods and a significant challenge in elucidating these dark regions in silico-rela
Abstract: Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods A significant challenge in elucidating these dark regions in silico rela

6 citations


Journal ArticleDOI
TL;DR: In this paper, a novel neural network-based approach called HemoNet was developed for predicting the hemolytic activity of peptides. But the proposed method is unable to accurately model various important aspects of this predictive problem such as the role of N/C-terminal modifications, D- and L- amino acids, etc.
Abstract: Quantifying the hemolytic activity of peptides is a crucial step in the discovery of novel therapeutic peptides. Computational methods are attractive in this domain due to their ability to guide wet-lab experimental discovery or screening of peptides based on their hemolytic activity. However, existing methods are unable to accurately model various important aspects of this predictive problem such as the role of N/C-terminal modifications, D- and L- amino acids, etc. In this work, we have developed a novel neural network-based approach called HemoNet for predicting the hemolytic activity of peptides. The proposed method captures the contextual importance of different amino acids in a given peptide sequence using a specialized feature embedding in conjunction with SMILES-based fingerprint representation of N/C-terminal modifications. We have analyzed the predictive performance of the proposed method using stratified cross-validation in comparison with previous methods, non-redundant cross-validation as well as validation on external peptides and clinical antimicrobial peptides. Our analysis shows the proposed approach achieves significantly better predictive performance (AUC-ROC of 88%) in comparison to previous approaches (HemoPI and HemoPred with AUC-ROC of 73%). HemoNet can be a useful tool in the search for novel therapeutic peptides. The python implementation of the proposed method is available at the URL: https://github.com/adibayaseen/HemoNet.

6 citations


Journal ArticleDOI
TL;DR: A central problem of systems biology is the reconstruction of Gene Regulatory Networks (GRNs) by the use of time series data as discussed by the authors, and many attempts have been made to design an efficient method for this task.
Abstract: A central problem of systems biology is the reconstruction of Gene Regulatory Networks (GRNs) by the use of time series data. Although many attempts have been made to design an efficient method for...

5 citations


Journal ArticleDOI
TL;DR: A novel CNN model called Drug Convolutional Neural Network (DCNN) to predict ADRs using chemical structures of the drugs, where the proposed model predicted the ADRs that are well aligned with the observations made by medical professionals using conventional methods.
Abstract: Prediction of Adverse Drug Reactions (ADRs) has been an important aspect of Pharmacovigilance because of its impact in the pharma industry. The standard process of introduction of a new drug into a market involves a lot of clinical trials and tests. This is a tedious and time consuming process and also involves a lot of monetary resources. The faster approval of a drug helps the patients who are in need of the drug. The in silico prediction of Adverse Drug Reactions can help speed up the aforementioned process. The challenges involved are lack of negative data present and predicting ADR from just the chemical structure. Although many models are already available to predict ADR, most of the models use biological activities identifiers, chemical and physical properties in addition to chemical structures of the drugs. But for most of the new drugs to be tested, only chemical structures will be available. The performance of the existing models predicting ADR only using chemical structures is not efficient. Therefore, an efficient prediction of ADRs from just the chemical structure has been proposed in this paper. The proposed method involves a separate model for each ADR, making it a binary classification problem. This paper presents a novel CNN model called Drug Convolutional Neural Network (DCNN) to predict ADRs using chemical structures of the drugs. The performance is measured using the metrics such as Accuracy, Recall, Precision, Specificity, F1 score, AUROC and MCC. The results obtained by the proposed DCNN model outperform the competing models on the SIDER4.1 database in terms of all the metrics. A case study has been performed on a COVID-19 recommended drugs, where the proposed model predicted the ADRs that are well aligned with the observations made by medical professionals using conventional methods.

5 citations


Journal ArticleDOI
TL;DR: It is shown that clustering FASTA data sets of short reads into similar sub-groups for a group-by-group compression can greatly improve the compression performance.
Abstract: FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of t...

5 citations


Journal ArticleDOI
TL;DR: A front-end method based on random forest proximity distance is used to screen the test set to improve protein-protein interaction site (PPIS) prediction and the distance information provided by PD can be used to indicate the reliability of prediction results.
Abstract: A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein-protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.

5 citations


Journal ArticleDOI
TL;DR: In this article, the occurrence structures of nitrogenous bases and amino acids were studied and a structural metric which could measure the structure differences for bases or amino acids was devised to find the average distance matrices for them, respectively.
Abstract: COVID-19 pandemic has caused a global health crisis. Developing vaccines would need a good knowledge of genetic properties of SARS-CoV-2. The most fundamental approach is to look into the structures of its RNA, in particular, the nucleotides and amino acids. This motivates our research on this topic. We study the occurrence structures of nitrogenous bases and amino acids. To this aim, we devise a structural metric which could measure the structure differences for bases or amino acids. By analyzing various SARS-CoV-2 samples, we calculate the distance matrices for nitrogenous bases and amino acids. Based on the distance matrices, we find the average distance matrices for them, respectively. Then we identify the relations of all the minimal distances between bases and amino acids. The results also show that different substructures would yield much more diversified distances between amino acids. In the end, we also conduct the comparison of our structural metric with other frequently used metrics, in particular, Hausdorff metrics.

4 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method for the detection of N6-methyladenine (6mA) sites by biochemical experiments, which can help to reveal their biological functions.
Abstract: Accurate detection of N6-methyladenine (6mA) sites by biochemical experiments will help to reveal their biological functions, still, these wet experiments are laborious and expensive. Therefore, it...

4 citations


Journal ArticleDOI
TL;DR: Experimental results have shown that the proposed method can be used as a reliable computational model to reveal potential relationships between miRNAs and diseases.
Abstract: MicroRNAs (miRNA) are a type of non-coding RNA molecules that are effective on the formation and the progression of many different diseases. Various researches have reported that miRNAs play a majo...

Journal ArticleDOI
TL;DR: In this article, a sequence-based machine learning framework for the prediction of pore-forming toxins was developed. But, the authors used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVM).
Abstract: Bacterial virulence can be attributed to a wide variety of factors including toxins that harm the host. Pore-forming toxins are one class of toxins that confer virulence to the bacteria and are one of the promising targets for therapeutic intervention. In this work, we develop a sequence-based machine learning framework for the prediction of pore-forming toxins. For this, we have used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVMs). The choice of conformational similarity and hydropathy indices is based on the functional mechanism of pore-forming toxins. Our methodology achieves about 81% accuracy indicating that conformational similarity, an indicator of the flexibility of amino acids, along with hydrophobic index can capture the intrinsic features of pore-forming toxins that distinguish it from other types of transporter proteins. Increased understanding of the mechanisms of pore-forming toxins can further contribute to the use of such "mechanism-informed" features that may increase the prediction accuracy further.

Journal ArticleDOI
TL;DR: In this article, a data-driven, deep learning-based imputation and inference framework (DIIF) is proposed to quantify the personalized and race-specific causal effects.
Abstract: Prostate Specific Antigen (PSA) level in the serum is one of the most widely used markers in monitoring prostate cancer (PCa) progression, treatment response, and disease relapse. Although significant efforts have been taken to analyze various socioeconomic and cultural factors that contribute to the racial disparities in PCa, limited research has been performed to quantitatively understand how and to what extent molecular alterations may impact differential PSA levels present at varied tumor status between African-American and European-American men. Moreover, missing values among patients add another layer of difficulty in precisely inferring their outcomes. In light of these issues, we propose a data-driven, deep learning-based imputation and inference framework (DIIF). DIIF seamlessly encapsulates two modules: an imputation module driven by a regularized deep autoencoder for imputing critical missing information and an inference module in which two deep variational autoencoders are coupled with a graphical inference model to quantify the personalized and race-specific causal effects. Large-scale empirical studies on the independent sub-cohorts of The Cancer Genome Atlas (TCGA) PCa patients demonstrate the effectiveness of DIIF. We further found that somatic mutations in TP53, ATM, PTEN, FOXA1, and PIK3CA are statistically significant genomic factors that may explain the racial disparities in different PCa features characterized by PSA.

Journal ArticleDOI
TL;DR: The Cox proportional hazards model has been widely used in cancer genomic research that aims to identify genes from high-dimensional gene expression space associated with the survival time of patie....
Abstract: The Cox proportional hazards model has been widely used in cancer genomic research that aims to identify genes from high-dimensional gene expression space associated with the survival time of patie...

Journal ArticleDOI
TL;DR: This study determined the best parameter values of a model that permit the construction of the front of Pareto with 50 individuals or phenotypes by using inverse engineering and a multi-objective optimization procedure that allows fitting more than one experimental growth curve simultaneously.
Abstract: Several mathematical models have been developed to understand the interactions of microorganisms in foods and predict their growth. The resulting model equations for the growth of interacting cells include several parameters that must be determined for the specific conditions to be modeled. In this study, these parameters were determined by using inverse engineering and a multi-objective optimization procedure that allows fitting more than one experimental growth curve simultaneously. A genetic algorithm was applied to obtain the best parameter values of a model that permit the construction of the front of Pareto with 50 individuals or phenotypes. The method was applied to three experimental data sets of simultaneous growth of lactic acid bacteria (LAB) and Listeria monocytogenes (LM). Then, the proposed method was compared with a conventional mono-objective sequential fit. We concluded that the multi-objective fit by the genetic algorithm gives superior results with more parameter identifiability than the conventional sequential approach.

Journal ArticleDOI
TL;DR: In this article, an alternating projected gradient approach, gradient KCCA (gradKCCA), was adopted to solve kernel canonical correlation with an additional constraint that projection directions have pre-images in the original data space, a sparsityinducing variant of the model is achieved through controlling the [Formula: see text]-norm of the preimages of the projection directions.
Abstract: Neuroimaging genetics has become an important research topic since it can reveal complex associations between genetic variants (i.e. single nucleotide polymorphisms (SNPs) and the structures or functions of the human brain. However, existing kernel mapping is difficult to directly use the sparse representation method in the kernel feature space, which makes it difficult for most existing sparse canonical correlation analysis (SCCA) methods to be directly promoted in the kernel feature space. To bridge this gap, we adopt a novel alternating projected gradient approach, gradient KCCA (gradKCCA) model to develop a powerful model for exploring the intrinsic associations among genetic markers, imaging quantitative traits (QTs) of interest. Specifically, this model solves kernel canonical correlation (KCCA) with an additional constraint that projection directions have pre-images in the original data space, a sparsity-inducing variant of the model is achieved through controlling the [Formula: see text]-norm of the preimages of the projection directions. We evaluate this model using Alzheimer's disease Neuroimaging Initiative (ADNI) cohort to discover the relationships among SNPs from Alzheimer's disease (AD) risk gene APOE, imaging QTs extracted from structural magnetic resonance imaging (MRI) scans. Our results show that the algorithm not only outperforms the traditional KCCA method in terms of Root Mean Square Error (RMSE) and Correlation Coefficient (CC) but also identify the meaningful and relevant biomarkers of SNPs (e.g. rs157594 and rs405697), which are positively related to right Postcentral and right SupraMarginal brain regions in this study. Empirical results indicate its promising capability in revealing biologically meaningful neuroimaging genetics associations and improving the disease-related mechanistic understanding of AD.

Journal ArticleDOI
TL;DR: In this paper, a novel variant of the artificial bee colony optimization algorithm is proposed to improve the exploitation process of DNA binding proteins, and the motif locations obtained using the derived common pattern are compared with the results of two other motif detection tools.
Abstract: In this work, we have developed an optimization framework for digging out common structural patterns inherent in DNA binding proteins. A novel variant of the artificial bee colony optimization algorithm is proposed to improve the exploitation process. Experiments on four benchmark objective functions for different dimensions proved the speedier convergence of the algorithm. Also, it has generated optimum features of Helix Turn Helix structural pattern based on the objective function defined with occurrence count on secondary structure. The proposed algorithm outperformed the compared methods in convergence speed and the quality of generated motif features. The motif locations obtained using the derived common pattern are compared with the results of two other motif detection tools. 92% of tested proteins have produced matching locations with the results of the compared methods. The performance of the approach was analyzed with various measures and observed higher sensitivity, specificity and area under the curve values. A novel strategy for druggability finding by docking studies, targeting the motif locations is also discussed.

Journal ArticleDOI
TL;DR: In this article, the performance of LeDock and three standalone scoring functions were tested by 195 high-quality protein-ligand complexes and the results showed that the success rate for the best pose of the free available docking program leDock achieved 89.20%, indicative of a strong sampling power.
Abstract: Molecular docking is a fast and efficient computational method for the prediction of the binding mode and binding affinity between a ligand and a target protein at the atomic level. However, the performance of current docking programs is less than satisfactory. Herein, with a focus on free programs and scoring functions, the performances of LeDock and three standalone scoring functions were tested by 195 high-quality protein-ligand complexes. Results showed that the success rate for the best pose of the free available docking program LeDock achieved 89.20%, indicative of a strong sampling power. Based on the poses generated by LeDock, a comparative evaluation on other three non-commercial scoring functions, including DSX (DrugScore X), PoseScore and X-score was performed. Among all the evaluated scoring functions, DSX and X-score exhibited the best scoring power and ranking power, respectively. The performances of LeDock, DSX and X-score were similar in docking power test, which was much better than the PoseScore. Accordingly, it was suggested that the combination of pose sampling by LeDock with rescoring by DSX or X-score could improve the prediction accuracy of molecular docking and applied in the lead discovery.

Journal ArticleDOI
TL;DR: A model called Sparse Robust Graph-regularized Non-negative Matrix Factorization based on Correntropy (SGNMFC), where the maximized correntropy replaces the traditional minimized Euclidean distance to improve the robustness of the algorithm.
Abstract: Non-negative Matrix Factorization (NMF) is a popular data dimension reduction method in recent years. The traditional NMF method has high sensitivity to data noise. In the paper, we propose a model called Sparse Robust Graph-regularized Non-negative Matrix Factorization based on Correntropy (SGNMFC). The maximized correntropy replaces the traditional minimized Euclidean distance to improve the robustness of the algorithm. Through the kernel function, correntropy can give less weight to outliers and noise in data but give greater weight to meaningful data. Meanwhile, the geometry structure of the high-dimensional data is completely preserved in the low-dimensional manifold through the graph regularization. Feature selection and sample clustering are commonly used methods for analyzing genes. Sparse constraints are applied to the loss function to reduce matrix complexity and analysis difficulty. Comparing the other five similar methods, the effectiveness of the SGNMFC model is proved by selection of differentially expressed genes and sample clustering experiments in three The Cancer Genome Atlas (TCGA) datasets.

Journal ArticleDOI
TL;DR: In this article, it is observed that the splice junctions lacking the consenses are not canonical splice junction junctions, and it is shown that these junctions are unsuitable for splice prediction.
Abstract: Most of the current computational models for splice junction prediction are based on the identification of canonical splice junctions. However, it is observed that the junctions lacking the consens...

Journal ArticleDOI
TL;DR: A new approach is introduced (called FastMtQTL) for multi-trait QTL analysis based on the assumption of multivariate normal distribution of phenotypic observations that can identify almost the same QTL positions as those identified by the existing methods.
Abstract: Multivariate simple interval mapping (SIM) is one of the most popular approaches for multiple quantitative trait locus (QTL) analysis. Both maximum likelihood (ML) and least squares (LS) multivariate regression (MVR) are widely used methods for multi-trait SIM. ML-based MVR (MVR-ML) is an expectation maximization (EM) algorithm based iterative and complex time-consuming approach. Although the LS-based MVR (MVR-LS) approach is not an iterative process, the calculation of likelihood ratio (LR) statistic in MVR-LS is also a time-consuming complex process. We have introduced a new approach (called FastMtQTL) for multi-trait QTL analysis based on the assumption of multivariate normal distribution of phenotypic observations. Our proposed method can identify almost the same QTL positions as those identified by the existing methods. Moreover, the proposed method takes comparatively less computation time because of the simplicity in the calculation of LR statistic by this method. In the proposed method, LR statistic is calculated only using the sample variance-covariance matrix of phenotypes and the conditional probability of QTL genotype given the marker genotypes. This improvement in computation time is advantageous when the numbers of phenotypes and individuals are larger, and the markers are very dense resulting in a QTL mapping with a bigger dataset.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a methodology to detect deregulation mechanisms with a particular focus on cancer subtypes based on the comparison between tumoral and healthy cells, and then they measured the ability of each transcription factor to explain these deregulations.
Abstract: In many cancers, mechanisms of gene regulation can be severely altered. Identification of deregulated genes, which do not follow the regulation processes that exist between transcription factors and their target genes, is of importance to better understand the development of the disease. We propose a methodology to detect deregulation mechanisms with a particular focus on cancer subtypes. This strategy is based on the comparison between tumoral and healthy cells. First, we use gene expression data from healthy cells to infer a reference gene regulatory network. Then, we compare it with gene expression levels in tumor samples to detect deregulated target genes. We finally measure the ability of each transcription factor to explain these deregulations. We apply our method on a public bladder cancer data set derived from The Cancer Genome Atlas project and confirm that it captures hallmarks of cancer subtypes. We also show that it enables the discovery of new potential biomarkers.

Journal ArticleDOI
TL;DR: Wajidarshad et al. as discussed by the authors explored the sequence-based prediction of change in protein binding affinity upon mutation and question the effectiveness of fold cross-validation (CV) across mutations adopted in previous studies to assess the generalization ability of such predictors with no known mutation during training.
Abstract: Accurately determining a change in protein binding affinity upon mutations is important to find novel therapeutics and to assist mutagenesis studies. Determination of change in binding affinity upon mutations requires sophisticated, expensive, and time-consuming wet-lab experiments that can be supported with computational methods. Most of the available computational prediction techniques depend upon protein structures that bound their applicability to only protein complexes with recognized 3D structures. In this work, we explore the sequence-based prediction of change in protein binding affinity upon mutation and question the effectiveness of [Formula: see text]-fold cross-validation (CV) across mutations adopted in previous studies to assess the generalization ability of such predictors with no known mutation during training. We have used protein sequence information instead of protein structures along with machine learning techniques to accurately predict the change in protein binding affinity upon mutation. Our proposed sequence-based novel change in protein binding affinity predictor called PANDA performs comparably to the existing methods gauged through an appropriate CV scheme and an external independent test dataset. On an external test dataset, our proposed method gives a maximum Pearson correlation coefficient of 0.52 in comparison to the state-of-the-art existing protein structure-based method called MutaBind which gives a maximum Pearson correlation coefficient of 0.59. Our proposed protein sequence-based method, to predict a change in binding affinity upon mutations, has wide applicability and comparable performance in comparison to existing protein structure-based methods. We made PANDA easily accessible through a cloud-based webserver and python code available at https://sites.google.com/view/wajidarshad/software and https://github.com/wajidarshad/panda, respectively.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a high-throughput chromosome conformation capture (Hi-C) method for studying the three-dimensional organization of genomes, which is one of the most popular methods for studying genomes.
Abstract: High-throughput chromosome conformation capture (Hi-C) is one of the most popular methods for studying the three-dimensional organization of genomes. However, Hi-C protocols can be expensive since ...

Journal ArticleDOI
TL;DR: The de Bruijn Graph Algorithm (DBG) as mentioned in this paper is one of the cornerstones algorithms in short read assembly and has been extended with the rapid advancement of the Next Generation Sequencing (NGS) technologies and low cost.
Abstract: The de Bruijn Graph algorithm (DBG) as one of the cornerstones algorithms in short read assembly has extended with the rapid advancement of the Next Generation Sequencing (NGS) technologies and low...

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method using characteristic period in place of traditionally used characteristic frequency by RRM-based methods, which can be readily used for any protein sequence provided its interface residues and protein family are known.
Abstract: Specific functions in biological processes are dependent on protein-protein interactions. Hot spot residues play a key role in the determination of these interactions and have wide applications in engineering proteins and drug discovery. Experimental techniques to identify hotspots are often labor intensive and expensive. Also, most of the computational methods which have been developed are structure based and need some training. In this work, hotspots have been identified by sequence information alone using the Resonant Recognition Model (RRM). The proposed method uses characteristic period in place of traditionally used characteristic frequency by RRM-based methods. The characteristic period has been extracted from the consensus spectrum of protein families using the Ramanujan Fourier Transform (RFT). Position-period plots for proteins have been generated using Short Time RFT (ST-RFT) with a Gaussian window. Hot spots have been identified by thresholding of the signal corresponding to the protein's characteristic period in the ST-RFT. To enhance the performance of the ST-RFT, Gaussian window shape parameter has been optimized using concentration measure as a metric. Better sensitivity of this method has been observed compared to other reported RRM-based methods. Since the method is model independent it does not requires any training and can be readily used for any protein sequence provided its interface residues and protein family are known.

Journal ArticleDOI
TL;DR: In this paper, the authors study the limitations imposed on the transcription process by the presence of short ubiquitous pauses and crowding, and demonstrate that a functional relationship among the model parameters can be estimated using a standard statistical analysis, and this functional relationship describes the various trade-offs that must be made in order for the gene to control the elongation process and achieve a desired average transcription time.
Abstract: In this paper, we study the limitations imposed on the transcription process by the presence of short ubiquitous pauses and crowding. These effects are especially pronounced in highly transcribed genes such as ribosomal genes (rrn) in fast growing bacteria. Our model indicates that the quantity and duration of pauses reported for protein-coding genes is incompatible with the average elongation rate observed in rrn genes. When maximal elongation rate is high, pause-induced traffic jams occur, increasing promoter occlusion, thereby lowering the initiation rate. This lowers average transcription rate and increases average transcription time. Increasing maximal elongation rate in the model is insufficient to match the experimentally observed average elongation rate in rrn genes. This suggests that there may be rrn-specific modifications to RNAP, which then experience fewer pauses, or pauses of shorter duration than those in protein-coding genes. We identify model parameter triples (maximal elongation rate, mean pause duration time, number of pauses) which are compatible with experimentally observed elongation rates. Average transcription time and average transcription rate are the model outputs investigated as proxies for cell fitness. These fitness functions are optimized for different parameter choices, opening up a possibility of differential control of these aspects of the elongation process, with potential evolutionary consequences. As an example, a gene's average transcription time may be crucial to fitness when the surrounding medium is prone to abrupt changes. This paper demonstrates that a functional relationship among the model parameters can be estimated using a standard statistical analysis, and this functional relationship describes the various trade-offs that must be made in order for the gene to control the elongation process and achieve a desired average transcription time. It also demonstrates the robustness of the system when a range of maximal elongation rates can be balanced with transcriptional pause data in order to maintain a desired fitness.

Journal ArticleDOI
TL;DR: A step-by-step algorithm for analyzing the affinity of protein interactions and an analysis of energy interactions between the active center of a protein and the wild-type peptide interacting with it, taking into account modifications of the latter are provided.
Abstract: This paper has developed and described a detailed method for selecting inhibitors based on modified natural peptides for the SARS-CoV BJ01 spike-glycoprotein. The selection of inhibitors is carried out by increasing the affinity of the peptide to the active center of the protein. This paper also provides a step-by-step algorithm for analyzing the affinity of protein interactions and presents an analysis of energy interactions between the active center of a protein and the wild-type peptide interacting with it, taking into account modifications of the latter. A description of the software package that implements the presented algorithm is given on the website https://binomlabs.com/covid19.

Journal ArticleDOI
TL;DR: Testing this algorithm on raw sequences consisting of both partial and complete nucleotide sequences of various bacteria has yielded good results in predicting the loci of prophages in them, suggesting that a data-centric approach can yield comparable results while using a fraction of the resources.
Abstract: This paper proposes a new algorithm for prophage loci prediction in bacteria. Prophages are defined in Bioinformatics as viral nucleotide sequences that are found intermixed with host nucleotide sequences in bacteria. The proposed algorithm uses machine learning patterns and processing methodologies in order to provide a highly efficient system for loci prediction, thereby reducing the time-space complexity required unlike others of its class. In the training phase, a pattern database is constructed from raw nucleotide sequences of both bacteria and viruses obtained from a training set. In the prediction phase, the aforementioned database is used along with Particle Swarm Optimization (PSO) to predict the probable loci of prophages in a test set of bacterial nucleotide sequences. Testing this method on raw sequences consisting of both partial and complete nucleotide sequences of various bacteria has yielded good results in predicting the loci of prophages in them. This algorithm and connected processes compare favorably in terms of predictive performance with others of its class such as PhiSpy and ProphET, while outperforming others in terms of raw processing speed, suggesting that a data-centric approach can yield comparable results while using a fraction of the resources.

Journal ArticleDOI
TL;DR: A scheme is proposed to monitor the response of cells after being infected by viruses, and this scheme can be extended straightforwardly to extract characteristics of trajectories of complex systems.
Abstract: Viral infection is a complicated dynamic process, in which viruses intrude into cells to duplicate themselves and trigger succeeding biological processes regulated by genes. It may lead to a serious disaster to human's health. A scheme is proposed to monitor the response of cells after being infected by viruses. Co-expression levels of genes measured at successive time points form a gene expression profile sequence, which is mapped to a temporal gene regulatory network. The fission and fusion of the communities of the networks are used to find the active parts. We investigated an experiment of injection of flu viruses into a total of 17 healthy volunteers, which develop into an infected group and a survival group. The survival group is much more chaotic, i.e. there occur complicated fissions and fusions of communities over the whole network. For the infected group, the most active part of the regulatory network forms a single community, but it is included in one of the large communities and completely conservative in the survival group. There are a total of six and seven genes in the active structure that take part in the Parkinson's disease and the ribosome pathways, respectively. Actually, a total of 30 genes (covering [Formula: see text]) of the genes in the active structure participate in the neuro-degeneration and its related pathways. This scheme can be extended straightforwardly to extract characteristics of trajectories of complex systems.