Showing papers in "Journal of Bioinformatics and Computational Biology in 2021"

PDF

Open Access

Journal Article•DOI•

DBP-GAPred: An intelligent method for prediction of DNA-binding proteins types by enhanced evolutionary profile features with ensemble learning.

[...]

Omar M. Ba-Rukab¹, Farman Ali², Sher Afzal Khan³•Institutions (3)

King Abdulaziz University¹, Nanjing University of Science and Technology², Abdul Wali Khan University Mardan³

21 Jul 2021-Journal of Bioinformatics and Computational Biology

TL;DR: Wang et al. as mentioned in this paper used discrete cosine transform (DCT) and support vector machine (SVM) to detect double-stranded DNA-binding proteins (dsDBPs).

...read moreread less

Abstract: DNA-binding proteins (DBPs) perform an influential role in diverse biological activities like DNA replication, slicing, repair, and transcription. Some DBPs are indispensable for understanding many types of human cancers (i.e. lung, breast, and liver cancer) and chronic diseases (i.e. AIDS/HIV, asthma), while other kinds are involved in antibiotics, steroids, and anti-inflammatory drugs designing. These crucial processes are closely related to DBPs types. DBPs are categorized into single-stranded DNA-binding proteins (ssDBPs) and double-stranded DNA-binding proteins (dsDBPs). Few computational predictors have been reported for discriminating ssDBPs and dsDBPs. However, due to the limitations of the existing methods, an intelligent computational system is still highly desirable. In this work, features from protein sequences are discovered by extending the notion of dipeptide composition (DPC), evolutionary difference formula (EDF), and K-separated bigram (KSB) into the position-specific scoring matrix (PSSM). The highly intrinsic information was encoded by a compression approach named discrete cosine transform (DCT) and the model was trained with support vector machine (SVM). The prediction performance was further boosted by the genetic algorithm (GA) ensemble strategy. The novel predictor (DBP-GAPred) acquired 1.89%, 0.28%, and 6.63% higher accuracies on jackknife, 10-fold, and independent dataset tests, respectively than the best predictor. These outcomes confirm the superiority of our method over the existing predictors.

...read moreread less

18 citations

Journal Article•DOI•

3D U-Net: A voxel-based method in binding site prediction of protein structure.

[...]

Fatemeh Nazem¹, Fahimeh Ghasemi¹, Afshin Fassihi¹, Alireza Mehri Dehnavi¹•Institutions (1)

Isfahan University of Medical Sciences¹

16 Apr 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this article, an improved 3D version of the U-Net model based on the dice loss function is utilized to predict the binding sites accurately, and the performance of the proposed model on the independent test datasets and SARS-COV-2 shows the segmentation model could predict the bound sites with a more accurate shape than the recently published deep learning model, i.e. DeepSite.

...read moreread less

Abstract: Binding site prediction for new proteins is important in structure-based drug design. The identified binding sites may be helpful in the development of treatments for new viral outbreaks in the world when there is no information available about their pockets with COVID-19 being a case in point. Identification of the pockets using computational methods, as an alternative method, has recently attracted much interest. In this study, the binding site prediction is viewed as a semantic segmentation problem. An improved 3D version of the U-Net model based on the dice loss function is utilized to predict the binding sites accurately. The performance of the proposed model on the independent test datasets and SARS-COV-2 shows the segmentation model could predict the binding sites with a more accurate shape than the recently published deep learning model, i.e. DeepSite. Therefore, the model may help predict the binding sites of proteins and could be used in drug design for novel proteins.

...read moreread less

7 citations

Journal Article•DOI•

Unsupervised multi-instance learning for protein structure determination.

[...]

Fardina Fathmiul Alam¹, Amarda Shehu¹•Institutions (1)

George Mason University¹

10 Feb 2021-Journal of Bioinformatics and Computational Biology

TL;DR: Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods and a significant challenge in elucidating these dark regions in silico-rela

...read moreread less

Abstract: Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods A significant challenge in elucidating these dark regions in silico rela

...read moreread less

6 citations

Journal Article•DOI•

HemoNet: Predicting hemolytic activity of peptides with integrated feature learning.

[...]

Adiba Yaseen¹, Sadaf Gull¹, Naeem Akhtar¹, Imran Amin², Fayyaz ul Amir Afsar Minhas³ - Show less +1 more•Institutions (3)

Pakistan Institute of Engineering and Applied Sciences¹, National Institute for Biotechnology and Genetic Engineering², University of Warwick³

05 Aug 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this paper, a novel neural network-based approach called HemoNet was developed for predicting the hemolytic activity of peptides. But the proposed method is unable to accurately model various important aspects of this predictive problem such as the role of N/C-terminal modifications, D- and L- amino acids, etc.

...read moreread less

Abstract: Quantifying the hemolytic activity of peptides is a crucial step in the discovery of novel therapeutic peptides. Computational methods are attractive in this domain due to their ability to guide wet-lab experimental discovery or screening of peptides based on their hemolytic activity. However, existing methods are unable to accurately model various important aspects of this predictive problem such as the role of N/C-terminal modifications, D- and L- amino acids, etc. In this work, we have developed a novel neural network-based approach called HemoNet for predicting the hemolytic activity of peptides. The proposed method captures the contextual importance of different amino acids in a given peptide sequence using a specialized feature embedding in conjunction with SMILES-based fingerprint representation of N/C-terminal modifications. We have analyzed the predictive performance of the proposed method using stratified cross-validation in comparison with previous methods, non-redundant cross-validation as well as validation on external peptides and clinical antimicrobial peptides. Our analysis shows the proposed approach achieves significantly better predictive performance (AUC-ROC of 88%) in comparison to previous approaches (HemoPI and HemoPred with AUC-ROC of 73%). HemoNet can be a useful tool in the search for novel therapeutic peptides. The python implementation of the proposed method is available at the URL: https://github.com/adibayaseen/HemoNet.

...read moreread less

6 citations

Journal Article•DOI•

KFGRNI: A robust method to inference gene regulatory network from time-course gene data based on ensemble Kalman filter

[...]

Jamshid Pirgazi, Mohammad Hossein Olyaee, Alireza Khanteymoori¹•Institutions (1)

University of Freiburg¹

03 Mar 2021-Journal of Bioinformatics and Computational Biology

TL;DR: A central problem of systems biology is the reconstruction of Gene Regulatory Networks (GRNs) by the use of time series data as discussed by the authors, and many attempts have been made to design an efficient method for this task.

...read moreread less

Abstract: A central problem of systems biology is the reconstruction of Gene Regulatory Networks (GRNs) by the use of time series data. Although many attempts have been made to design an efficient method for...

...read moreread less

5 citations

Journal Article•DOI•

Prediction of adverse drug reactions using drug convolutional neural networks.

[...]

Anjani Sankar Mantripragada, Sai Phani Teja, Rohith Reddy Katasani, Pratik Joshi, Masilamani, Raj Ramesh - Show less +2 more

20 Jan 2021-Journal of Bioinformatics and Computational Biology

TL;DR: A novel CNN model called Drug Convolutional Neural Network (DCNN) to predict ADRs using chemical structures of the drugs, where the proposed model predicted the ADRs that are well aligned with the observations made by medical professionals using conventional methods.

...read moreread less

Abstract: Prediction of Adverse Drug Reactions (ADRs) has been an important aspect of Pharmacovigilance because of its impact in the pharma industry. The standard process of introduction of a new drug into a market involves a lot of clinical trials and tests. This is a tedious and time consuming process and also involves a lot of monetary resources. The faster approval of a drug helps the patients who are in need of the drug. The in silico prediction of Adverse Drug Reactions can help speed up the aforementioned process. The challenges involved are lack of negative data present and predicting ADR from just the chemical structure. Although many models are already available to predict ADR, most of the models use biological activities identifiers, chemical and physical properties in addition to chemical structures of the drugs. But for most of the new drugs to be tested, only chemical structures will be available. The performance of the existing models predicting ADR only using chemical structures is not efficient. Therefore, an efficient prediction of ADRs from just the chemical structure has been proposed in this paper. The proposed method involves a separate model for each ADR, making it a binary classification problem. This paper presents a novel CNN model called Drug Convolutional Neural Network (DCNN) to predict ADRs using chemical structures of the drugs. The performance is measured using the metrics such as Accuracy, Recall, Precision, Specificity, F1 score, AUROC and MCC. The results obtained by the proposed DCNN model outperform the competing models on the SIDER4.1 database in terms of all the metrics. A case study has been performed on a COVID-19 recommended drugs, where the proposed model predicted the ADRs that are well aligned with the observations made by medical professionals using conventional methods.

...read moreread less

5 citations

Journal Article•DOI•

Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases.

[...]

Tao Tang¹, Jinyan Li¹•Institutions (1)

University of Technology, Sydney¹

20 Jan 2021-Journal of Bioinformatics and Computational Biology

TL;DR: It is shown that clustering FASTA data sets of short reads into similar sub-groups for a group-by-group compression can greatly improve the compression performance.

...read moreread less

Abstract: FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of t...

...read moreread less

5 citations

Journal Article•DOI•

Protein-protein interaction site prediction using random forest proximity distance

[...]

Zhijun Qiu¹, Qingjie Liu¹•Institutions (1)

Henan University of Science and Technology¹

01 Feb 2021-Journal of Bioinformatics and Computational Biology

TL;DR: A front-end method based on random forest proximity distance is used to screen the test set to improve protein-protein interaction site (PPIS) prediction and the distance information provided by PD can be used to indicate the reliability of prediction results.

...read moreread less

Abstract: A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein-protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.

...read moreread less

5 citations

Journal Article•DOI•

Distance matrices for nitrogenous bases and amino acids of SARS-CoV-2 via structural metric.

[...]

Ray-Ming Chen

30 Apr 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this article, the occurrence structures of nitrogenous bases and amino acids were studied and a structural metric which could measure the structure differences for bases or amino acids was devised to find the average distance matrices for them, respectively.

...read moreread less

Abstract: COVID-19 pandemic has caused a global health crisis. Developing vaccines would need a good knowledge of genetic properties of SARS-CoV-2. The most fundamental approach is to look into the structures of its RNA, in particular, the nucleotides and amino acids. This motivates our research on this topic. We study the occurrence structures of nitrogenous bases and amino acids. To this aim, we devise a structural metric which could measure the structure differences for bases or amino acids. By analyzing various SARS-CoV-2 samples, we calculate the distance matrices for nitrogenous bases and amino acids. Based on the distance matrices, we find the average distance matrices for them, respectively. Then we identify the relations of all the minimal distances between bases and amino acids. The results also show that different substructures would yield much more diversified distances between amino acids. In the end, we also conduct the comparison of our structural metric with other frequently used metrics, in particular, Hausdorff metrics.

...read moreread less

4 citations

Journal Article•DOI•

iDNA6mA-Rice-DL: A local web server for identifying DNA N6-methyladenine sites in rice genome by deep learning method.

[...]

Shiqian He¹, Liang Kong¹, Jing Chen²•Institutions (2)

Hebei Normal University of Science and Technology¹, Yanshan University²

21 Jul 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this paper, the authors proposed a method for the detection of N6-methyladenine (6mA) sites by biochemical experiments, which can help to reveal their biological functions.

...read moreread less

Abstract: Accurate detection of N6-methyladenine (6mA) sites by biochemical experiments will help to reveal their biological functions, still, these wet experiments are laborious and expensive. Therefore, it...

...read moreread less

4 citations

Journal Article•DOI•

Prediction of miRNA-disease associations based on Weighted K-Nearest known neighbors and network consistency projection

[...]

Ahmet Toprak¹, Esma Eryilmaz¹•Institutions (1)

Selçuk University¹

01 Feb 2021-Journal of Bioinformatics and Computational Biology

TL;DR: Experimental results have shown that the proposed method can be used as a reliable computational model to reveal potential relationships between miRNAs and diseases.

...read moreread less

Abstract: MicroRNAs (miRNA) are a type of non-coding RNA molecules that are effective on the formation and the progression of many different diseases. Various researches have reported that miRNAs play a majo...

...read moreread less

Journal Article•DOI•

Support vector machine-based prediction of pore-forming toxins (PFT) using distributed representation of reduced alphabets.

[...]

Hrushikesh Bhosale¹, Vigneshwar Ramakrishnan, Valadi K Jayaraman¹•Institutions (1)

Flame University¹

22 Oct 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this article, a sequence-based machine learning framework for the prediction of pore-forming toxins was developed. But, the authors used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVM).

...read moreread less

Abstract: Bacterial virulence can be attributed to a wide variety of factors including toxins that harm the host. Pore-forming toxins are one class of toxins that confer virulence to the bacteria and are one of the promising targets for therapeutic intervention. In this work, we develop a sequence-based machine learning framework for the prediction of pore-forming toxins. For this, we have used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVMs). The choice of conformational similarity and hydropathy indices is based on the functional mechanism of pore-forming toxins. Our methodology achieves about 81% accuracy indicating that conformational similarity, an indicator of the flexibility of amino acids, along with hydrophobic index can capture the intrinsic features of pore-forming toxins that distinguish it from other types of transporter proteins. Increased understanding of the mechanisms of pore-forming toxins can further contribute to the use of such "mechanism-informed" features that may increase the prediction accuracy further.

...read moreread less

Journal Article•DOI•

A deep imputation and inference framework for estimating personalized and race-specific causal effects of genomic alterations on PSA.

[...]

Zhong Chen¹, Bo Cao¹, Andrea Edwards¹, Hong-Wen Deng², Kun Zhang¹ - Show less +1 more•Institutions (2)

Xavier University of Louisiana¹, Tulane University²

02 Jul 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this article, a data-driven, deep learning-based imputation and inference framework (DIIF) is proposed to quantify the personalized and race-specific causal effects.

...read moreread less

Abstract: Prostate Specific Antigen (PSA) level in the serum is one of the most widely used markers in monitoring prostate cancer (PCa) progression, treatment response, and disease relapse. Although significant efforts have been taken to analyze various socioeconomic and cultural factors that contribute to the racial disparities in PCa, limited research has been performed to quantitatively understand how and to what extent molecular alterations may impact differential PSA levels present at varied tumor status between African-American and European-American men. Moreover, missing values among patients add another layer of difficulty in precisely inferring their outcomes. In light of these issues, we propose a data-driven, deep learning-based imputation and inference framework (DIIF). DIIF seamlessly encapsulates two modules: an imputation module driven by a regularized deep autoencoder for imputing critical missing information and an inference module in which two deep variational autoencoders are coupled with a graphical inference model to quantify the personalized and race-specific causal effects. Large-scale empirical studies on the independent sub-cohorts of The Cancer Genome Atlas (TCGA) PCa patients demonstrate the effectiveness of DIIF. We further found that somatic mutations in TP53, ATM, PTEN, FOXA1, and PIK3CA are statistically significant genomic factors that may explain the racial disparities in different PCa features characterized by PSA.

...read moreread less

Journal Article•DOI•

Incorporating biological networks into high-dimensional Bayesian survival analysis using an ICM/M algorithm

[...]

Vitara Pungpapong¹•Institutions (1)

Chulalongkorn University¹

22 Oct 2021-Journal of Bioinformatics and Computational Biology

TL;DR: The Cox proportional hazards model has been widely used in cancer genomic research that aims to identify genes from high-dimensional gene expression space associated with the survival time of patie....

...read moreread less

Abstract: The Cox proportional hazards model has been widely used in cancer genomic research that aims to identify genes from high-dimensional gene expression space associated with the survival time of patie...

...read moreread less

Journal Article•DOI•

Genetic algorithm applied to simultaneous parameter estimation in bacterial growth.

[...]

Hector Alejandro Pedrozo¹, Andrea Micaela Dallagnol¹, Carlos Enrique Schvezov¹•Institutions (1)

National Scientific and Technical Research Council¹

27 Jan 2021-Journal of Bioinformatics and Computational Biology

TL;DR: This study determined the best parameter values of a model that permit the construction of the front of Pareto with 50 individuals or phenotypes by using inverse engineering and a multi-objective optimization procedure that allows fitting more than one experimental growth curve simultaneously.

...read moreread less

Abstract: Several mathematical models have been developed to understand the interactions of microorganisms in foods and predict their growth. The resulting model equations for the growth of interacting cells include several parameters that must be determined for the specific conditions to be modeled. In this study, these parameters were determined by using inverse engineering and a multi-objective optimization procedure that allows fitting more than one experimental growth curve simultaneously. A genetic algorithm was applied to obtain the best parameter values of a model that permit the construction of the front of Pareto with 50 individuals or phenotypes. The method was applied to three experimental data sets of simultaneous growth of lactic acid bacteria (LAB) and Listeria monocytogenes (LM). Then, the proposed method was compared with a conventional mono-objective sequential fit. We concluded that the multi-objective fit by the genetic algorithm gives superior results with more parameter identifiability than the conventional sequential approach.

...read moreread less

Journal Article•DOI•

Detecting genetic associations with brain imaging phenotypes in Alzheimer’s disease via a novel structured KCCA approach

[...]

Lei Wang¹, Wei Kong¹, Shuaiqun Wang¹•Institutions (1)

Shanghai Maritime University¹

04 May 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this article, an alternating projected gradient approach, gradient KCCA (gradKCCA), was adopted to solve kernel canonical correlation with an additional constraint that projection directions have pre-images in the original data space, a sparsityinducing variant of the model is achieved through controlling the [Formula: see text]-norm of the preimages of the projection directions.

...read moreread less

Abstract: Neuroimaging genetics has become an important research topic since it can reveal complex associations between genetic variants (i.e. single nucleotide polymorphisms (SNPs) and the structures or functions of the human brain. However, existing kernel mapping is difficult to directly use the sparse representation method in the kernel feature space, which makes it difficult for most existing sparse canonical correlation analysis (SCCA) methods to be directly promoted in the kernel feature space. To bridge this gap, we adopt a novel alternating projected gradient approach, gradient KCCA (gradKCCA) model to develop a powerful model for exploring the intrinsic associations among genetic markers, imaging quantitative traits (QTs) of interest. Specifically, this model solves kernel canonical correlation (KCCA) with an additional constraint that projection directions have pre-images in the original data space, a sparsity-inducing variant of the model is achieved through controlling the [Formula: see text]-norm of the preimages of the projection directions. We evaluate this model using Alzheimer's disease Neuroimaging Initiative (ADNI) cohort to discover the relationships among SNPs from Alzheimer's disease (AD) risk gene APOE, imaging QTs extracted from structural magnetic resonance imaging (MRI) scans. Our results show that the algorithm not only outperforms the traditional KCCA method in terms of Root Mean Square Error (RMSE) and Correlation Coefficient (CC) but also identify the meaningful and relevant biomarkers of SNPs (e.g. rs157594 and rs405697), which are positively related to right Postcentral and right SupraMarginal brain regions in this study. Empirical results indicate its promising capability in revealing biologically meaningful neuroimaging genetics associations and improving the disease-related mechanistic understanding of AD.

...read moreread less

Journal Article•DOI•

Mining of structural motifs in proteins using artificial bee colony optimization framework for druggability.

[...]

L S Suma¹, S. S. Vinod Chandra¹•Institutions (1)

University of Kerala¹

30 Sep 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this paper, a novel variant of the artificial bee colony optimization algorithm is proposed to improve the exploitation process of DNA binding proteins, and the motif locations obtained using the derived common pattern are compared with the results of two other motif detection tools.

...read moreread less

Abstract: In this work, we have developed an optimization framework for digging out common structural patterns inherent in DNA binding proteins. A novel variant of the artificial bee colony optimization algorithm is proposed to improve the exploitation process. Experiments on four benchmark objective functions for different dimensions proved the speedier convergence of the algorithm. Also, it has generated optimum features of Helix Turn Helix structural pattern based on the objective function defined with occurrence count on secondary structure. The proposed algorithm outperformed the compared methods in convergence speed and the quality of generated motif features. The motif locations obtained using the derived common pattern are compared with the results of two other motif detection tools. 92% of tested proteins have produced matching locations with the results of the compared methods. The performance of the approach was analyzed with various measures and observed higher sensitivity, specificity and area under the curve values. A novel strategy for druggability finding by docking studies, targeting the motif locations is also discussed.

...read moreread less

Journal Article•DOI•

An evaluation of combined strategies for improving the performance of molecular docking.

[...]

Siqi Xu, Li Wang, Xianchao Pan

27 Feb 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this article, the performance of LeDock and three standalone scoring functions were tested by 195 high-quality protein-ligand complexes and the results showed that the success rate for the best pose of the free available docking program leDock achieved 89.20%, indicative of a strong sampling power.

...read moreread less

Abstract: Molecular docking is a fast and efficient computational method for the prediction of the binding mode and binding affinity between a ligand and a target protein at the atomic level. However, the performance of current docking programs is less than satisfactory. Herein, with a focus on free programs and scoring functions, the performances of LeDock and three standalone scoring functions were tested by 195 high-quality protein-ligand complexes. Results showed that the success rate for the best pose of the free available docking program LeDock achieved 89.20%, indicative of a strong sampling power. Based on the poses generated by LeDock, a comparative evaluation on other three non-commercial scoring functions, including DSX (DrugScore X), PoseScore and X-score was performed. Among all the evaluated scoring functions, DSX and X-score exhibited the best scoring power and ranking power, respectively. The performances of LeDock, DSX and X-score were similar in docking power test, which was much better than the PoseScore. Accordingly, it was suggested that the combination of pose sampling by LeDock with rescoring by DSX or X-score could improve the prediction accuracy of molecular docking and applied in the lead discovery.

...read moreread less

Journal Article•DOI•

Sparse robust graph-regularized non-negative matrix factorization based on correntropy.

[...]

Chuan-Yuan Wang¹, Ying-Lian Gao¹, Jin-Xing Liu¹, Ling-Yun Dai¹, Junliang Shang¹ - Show less +1 more•Institutions (1)

Qufu Normal University¹

06 Jan 2021-Journal of Bioinformatics and Computational Biology

TL;DR: A model called Sparse Robust Graph-regularized Non-negative Matrix Factorization based on Correntropy (SGNMFC), where the maximized correntropy replaces the traditional minimized Euclidean distance to improve the robustness of the algorithm.

...read moreread less

Abstract: Non-negative Matrix Factorization (NMF) is a popular data dimension reduction method in recent years. The traditional NMF method has high sensitivity to data noise. In the paper, we propose a model called Sparse Robust Graph-regularized Non-negative Matrix Factorization based on Correntropy (SGNMFC). The maximized correntropy replaces the traditional minimized Euclidean distance to improve the robustness of the algorithm. Through the kernel function, correntropy can give less weight to outliers and noise in data but give greater weight to meaningful data. Meanwhile, the geometry structure of the high-dimensional data is completely preserved in the low-dimensional manifold through the graph regularization. Feature selection and sample clustering are commonly used methods for analyzing genes. Sparse constraints are applied to the loss function to reduce matrix complexity and analysis difficulty. Comparing the other five similar methods, the effectiveness of the SGNMFC model is proved by selection of differentially expressed genes and sample clustering experiments in three The Cancer Genome Atlas (TCGA) datasets.

...read moreread less

Journal Article•DOI•

SpliceViNCI: Visualizing the splicing of non-canonical introns through recurrent neural networks

[...]

Aparajita Dutta¹, Kusum Kumari Singh¹, Ashish Anand¹•Institutions (1)

Indian Institute of Technology Guwahati¹

04 Jun 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this article, it is observed that the splice junctions lacking the consenses are not canonical splice junction junctions, and it is shown that these junctions are unsuitable for splice prediction.

...read moreread less

Abstract: Most of the current computational models for splice junction prediction are based on the identification of canonical splice junctions. However, it is observed that the junctions lacking the consens...

...read moreread less

Journal Article•DOI•

Regression based fast multi-trait genome-wide QTL analysis

[...]

Md. Jahangir Alam¹, Md. Ripter Hossain¹, S. M. Shahinul Islam¹, Md. Nurul Haque Mollah¹•Institutions (1)

University of Rajshahi¹

20 Jan 2021-Journal of Bioinformatics and Computational Biology

TL;DR: A new approach is introduced (called FastMtQTL) for multi-trait QTL analysis based on the assumption of multivariate normal distribution of phenotypic observations that can identify almost the same QTL positions as those identified by the existing methods.

...read moreread less

Abstract: Multivariate simple interval mapping (SIM) is one of the most popular approaches for multiple quantitative trait locus (QTL) analysis. Both maximum likelihood (ML) and least squares (LS) multivariate regression (MVR) are widely used methods for multi-trait SIM. ML-based MVR (MVR-ML) is an expectation maximization (EM) algorithm based iterative and complex time-consuming approach. Although the LS-based MVR (MVR-LS) approach is not an iterative process, the calculation of likelihood ratio (LR) statistic in MVR-LS is also a time-consuming complex process. We have introduced a new approach (called FastMtQTL) for multi-trait QTL analysis based on the assumption of multivariate normal distribution of phenotypic observations. Our proposed method can identify almost the same QTL positions as those identified by the existing methods. Moreover, the proposed method takes comparatively less computation time because of the simplicity in the calculation of LR statistic by this method. In the proposed method, LR statistic is calculated only using the sample variance-covariance matrix of phenotypes and the conditional probability of QTL genotype given the marker genotypes. This improvement in computation time is advantageous when the numbers of phenotypes and individuals are larger, and the markers are very dense resulting in a QTL mapping with a bigger dataset.

...read moreread less

Journal Article•DOI•

Identification of deregulation mechanisms specific to cancer subtypes.

[...]

Magali Champion¹, Julien Chiquet², Pierre Neuvial³, Mohamed Elati⁴, François Radvanyi⁵, Etienne Birmelé⁶, Etienne Birmelé¹ - Show less +3 more•Institutions (6)

University of Paris¹, Université Paris-Saclay², Institut de Mathématiques de Toulouse³, university of lille⁴, PSL Research University⁵, University of Strasbourg⁶

01 Feb 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this paper, the authors proposed a methodology to detect deregulation mechanisms with a particular focus on cancer subtypes based on the comparison between tumoral and healthy cells, and then they measured the ability of each transcription factor to explain these deregulations.

...read moreread less

Abstract: In many cancers, mechanisms of gene regulation can be severely altered. Identification of deregulated genes, which do not follow the regulation processes that exist between transcription factors and their target genes, is of importance to better understand the development of the disease. We propose a methodology to detect deregulation mechanisms with a particular focus on cancer subtypes. This strategy is based on the comparison between tumoral and healthy cells. First, we use gene expression data from healthy cells to infer a reference gene regulatory network. Then, we compare it with gene expression levels in tumor samples to detect deregulated target genes. We finally measure the ability of each transcription factor to explain these deregulations. We apply our method on a public bladder cancer data set derived from The Cancer Genome Atlas project and confirm that it captures hallmarks of cancer subtypes. We also show that it enables the discovery of new potential biomarkers.

...read moreread less

Journal Article•DOI•

PANDA: Predicting the change in proteins binding affinity upon mutations by finding a signal in primary structures.

[...]

Wajid Arshad Abbasi¹, Syed Ali Abbas¹, Saiqa Andleeb¹•Institutions (1)

University of Azad Jammu and Kashmir¹

11 Jun 2021-Journal of Bioinformatics and Computational Biology

TL;DR: Wajidarshad et al. as discussed by the authors explored the sequence-based prediction of change in protein binding affinity upon mutation and question the effectiveness of fold cross-validation (CV) across mutations adopted in previous studies to assess the generalization ability of such predictors with no known mutation during training.

...read moreread less

Abstract: Accurately determining a change in protein binding affinity upon mutations is important to find novel therapeutics and to assist mutagenesis studies. Determination of change in binding affinity upon mutations requires sophisticated, expensive, and time-consuming wet-lab experiments that can be supported with computational methods. Most of the available computational prediction techniques depend upon protein structures that bound their applicability to only protein complexes with recognized 3D structures. In this work, we explore the sequence-based prediction of change in protein binding affinity upon mutation and question the effectiveness of [Formula: see text]-fold cross-validation (CV) across mutations adopted in previous studies to assess the generalization ability of such predictors with no known mutation during training. We have used protein sequence information instead of protein structures along with machine learning techniques to accurately predict the change in protein binding affinity upon mutation. Our proposed sequence-based novel change in protein binding affinity predictor called PANDA performs comparably to the existing methods gauged through an appropriate CV scheme and an external independent test dataset. On an external test dataset, our proposed method gives a maximum Pearson correlation coefficient of 0.52 in comparison to the state-of-the-art existing protein structure-based method called MutaBind which gives a maximum Pearson correlation coefficient of 0.59. Our proposed protein sequence-based method, to predict a change in binding affinity upon mutations, has wide applicability and comparable performance in comparison to existing protein structure-based methods. We made PANDA easily accessible through a cloud-based webserver and python code available at https://sites.google.com/view/wajidarshad/software and https://github.com/wajidarshad/panda, respectively.

...read moreread less

Journal Article•DOI•

ReHiC: Enhancing Hi-C data resolution via residual convolutional network.

[...]

Zhe Cheng¹, Lin Liu², Guoliang Lin¹, Chao Yi¹, Xing Chu¹, Yu Liang¹, Wei Zhou¹, Xin Jin¹ - Show less +4 more•Institutions (2)

Yunnan University¹, Yunnan Normal University²

08 Mar 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this paper, the authors proposed a high-throughput chromosome conformation capture (Hi-C) method for studying the three-dimensional organization of genomes, which is one of the most popular methods for studying genomes.

...read moreread less

Abstract: High-throughput chromosome conformation capture (Hi-C) is one of the most popular methods for studying the three-dimensional organization of genomes. However, Hi-C protocols can be expensive since ...

...read moreread less

Journal Article•DOI•

RMI-DBG algorithm: A more agile iterative de Bruijn graph algorithm in short read genome assembly.

[...]

Zeinab Zare Hosseini¹, Shekoufeh Kolahdouz Rahimi¹, Esmaeil Forouzan, Ahmad Baraani¹•Institutions (1)

University of Isfahan¹

16 Apr 2021-Journal of Bioinformatics and Computational Biology

TL;DR: The de Bruijn Graph Algorithm (DBG) as mentioned in this paper is one of the cornerstones algorithms in short read assembly and has been extended with the rapid advancement of the Next Generation Sequencing (NGS) technologies and low cost.

...read moreread less

Abstract: The de Bruijn Graph algorithm (DBG) as one of the cornerstones algorithms in short read assembly has extended with the rapid advancement of the Next Generation Sequencing (NGS) technologies and low...

...read moreread less

Journal Article•DOI•

Hot spots localization in proteins by optimized short time Ramanujan Fourier transform.

[...]

Yashpal Yadav¹, Sanjeev Narayan Sharma¹, D. K. Shakya¹, Abhishek Panchal¹•Institutions (1)

Samrat Ashok Technological Institute¹

01 Apr 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this paper, the authors proposed a method using characteristic period in place of traditionally used characteristic frequency by RRM-based methods, which can be readily used for any protein sequence provided its interface residues and protein family are known.

...read moreread less

Abstract: Specific functions in biological processes are dependent on protein-protein interactions. Hot spot residues play a key role in the determination of these interactions and have wide applications in engineering proteins and drug discovery. Experimental techniques to identify hotspots are often labor intensive and expensive. Also, most of the computational methods which have been developed are structure based and need some training. In this work, hotspots have been identified by sequence information alone using the Resonant Recognition Model (RRM). The proposed method uses characteristic period in place of traditionally used characteristic frequency by RRM-based methods. The characteristic period has been extracted from the consensus spectrum of protein families using the Ramanujan Fourier Transform (RFT). Position-period plots for proteins have been generated using Short Time RFT (ST-RFT) with a Gaussian window. Hot spots have been identified by thresholding of the signal corresponding to the protein's characteristic period in the ST-RFT. To enhance the performance of the ST-RFT, Gaussian window shape parameter has been optimized using concentration measure as a metric. Better sensitivity of this method has been observed compared to other reported RRM-based methods. Since the method is model independent it does not requires any training and can be readily used for any protein sequence provided its interface residues and protein family are known.

...read moreread less

Journal Article•DOI•

Trade-offs among transcription elongation rate, number, and duration of ubiquitous pauses on highly transcribed bacterial genes.

[...]

Tomáš Gedeon¹, Lisa Davis¹, Katelyn Weber², Jennifer Thorenson³•Institutions (3)

Montana State University¹, London School of Economics and Political Science², University of Oregon³

05 Aug 2021-Journal of Bioinformatics and Computational Biology

TL;DR: In this paper, the authors study the limitations imposed on the transcription process by the presence of short ubiquitous pauses and crowding, and demonstrate that a functional relationship among the model parameters can be estimated using a standard statistical analysis, and this functional relationship describes the various trade-offs that must be made in order for the gene to control the elongation process and achieve a desired average transcription time.

...read moreread less

Abstract: In this paper, we study the limitations imposed on the transcription process by the presence of short ubiquitous pauses and crowding. These effects are especially pronounced in highly transcribed genes such as ribosomal genes (rrn) in fast growing bacteria. Our model indicates that the quantity and duration of pauses reported for protein-coding genes is incompatible with the average elongation rate observed in rrn genes. When maximal elongation rate is high, pause-induced traffic jams occur, increasing promoter occlusion, thereby lowering the initiation rate. This lowers average transcription rate and increases average transcription time. Increasing maximal elongation rate in the model is insufficient to match the experimentally observed average elongation rate in rrn genes. This suggests that there may be rrn-specific modifications to RNAP, which then experience fewer pauses, or pauses of shorter duration than those in protein-coding genes. We identify model parameter triples (maximal elongation rate, mean pause duration time, number of pauses) which are compatible with experimentally observed elongation rates. Average transcription time and average transcription rate are the model outputs investigated as proxies for cell fitness. These fitness functions are optimized for different parameter choices, opening up a possibility of differential control of these aspects of the elongation process, with potential evolutionary consequences. As an example, a gene's average transcription time may be crucial to fitness when the surrounding medium is prone to abrupt changes. This paper demonstrates that a functional relationship among the model parameters can be estimated using a standard statistical analysis, and this functional relationship describes the various trade-offs that must be made in order for the gene to control the elongation process and achieve a desired average transcription time. It also demonstrates the robustness of the system when a range of maximal elongation rates can be balanced with transcriptional pause data in order to maintain a desired fitness.

...read moreread less

Journal Article•DOI•

An innovative method for the selection of inhibitors of the viral spike-glycoprotein of the SARS-CoV.

[...]

T. V. Koshlan¹, K. G. Kulikov²•Institutions (2)

Saint Petersburg State University¹, Saint Petersburg State Polytechnic University²

01 Feb 2021-Journal of Bioinformatics and Computational Biology

TL;DR: A step-by-step algorithm for analyzing the affinity of protein interactions and an analysis of energy interactions between the active center of a protein and the wild-type peptide interacting with it, taking into account modifications of the latter are provided.

...read moreread less

Abstract: This paper has developed and described a detailed method for selecting inhibitors based on modified natural peptides for the SARS-CoV BJ01 spike-glycoprotein. The selection of inhibitors is carried out by increasing the affinity of the peptide to the active center of the protein. This paper also provides a step-by-step algorithm for analyzing the affinity of protein interactions and presents an analysis of energy interactions between the active center of a protein and the wild-type peptide interacting with it, taking into account modifications of the latter. A description of the software package that implements the presented algorithm is given on the website https://binomlabs.com/covid19.

...read moreread less

Journal Article•DOI•

Prophage loci predictor for bacterial genomes.

[...]

Manu Rajan Nair¹, T Amudha¹•Institutions (1)

Bharathiar University¹

27 Feb 2021-Journal of Bioinformatics and Computational Biology

TL;DR: Testing this algorithm on raw sequences consisting of both partial and complete nucleotide sequences of various bacteria has yielded good results in predicting the loci of prophages in them, suggesting that a data-centric approach can yield comparable results while using a fraction of the resources.

...read moreread less

Abstract: This paper proposes a new algorithm for prophage loci prediction in bacteria. Prophages are defined in Bioinformatics as viral nucleotide sequences that are found intermixed with host nucleotide sequences in bacteria. The proposed algorithm uses machine learning patterns and processing methodologies in order to provide a highly efficient system for loci prediction, thereby reducing the time-space complexity required unlike others of its class. In the training phase, a pattern database is constructed from raw nucleotide sequences of both bacteria and viruses obtained from a training set. In the prediction phase, the aforementioned database is used along with Particle Swarm Optimization (PSO) to predict the probable loci of prophages in a test set of bacterial nucleotide sequences. Testing this method on raw sequences consisting of both partial and complete nucleotide sequences of various bacteria has yielded good results in predicting the loci of prophages in them. This algorithm and connected processes compare favorably in terms of predictive performance with others of its class such as PhiSpy and ProphET, while outperforming others in terms of raw processing speed, suggesting that a data-centric approach can yield comparable results while using a fraction of the resources.

...read moreread less

Journal Article•DOI•

Response of gene regulatory networks after infection of H3N2 virus.

[...]

Shiguo Deng¹, Qianshun Yuan², Jing Zhang², Huijie Yang²•Institutions (2)

Shanghai University of Engineering Sciences¹, University of Shanghai for Science and Technology²

21 Jul 2021-Journal of Bioinformatics and Computational Biology

TL;DR: A scheme is proposed to monitor the response of cells after being infected by viruses, and this scheme can be extended straightforwardly to extract characteristics of trajectories of complex systems.

...read moreread less

Abstract: Viral infection is a complicated dynamic process, in which viruses intrude into cells to duplicate themselves and trigger succeeding biological processes regulated by genes. It may lead to a serious disaster to human's health. A scheme is proposed to monitor the response of cells after being infected by viruses. Co-expression levels of genes measured at successive time points form a gene expression profile sequence, which is mapped to a temporal gene regulatory network. The fission and fusion of the communities of the networks are used to find the active parts. We investigated an experiment of injection of flu viruses into a total of 17 healthy volunteers, which develop into an infected group and a survival group. The survival group is much more chaotic, i.e. there occur complicated fissions and fusions of communities over the whole network. For the infected group, the most active part of the regulatory network forms a single community, but it is included in one of the large communities and completely conservative in the survival group. There are a total of six and seven genes in the active structure that take part in the Parkinson's disease and the ribosome pathways, respectively. Actually, a total of 30 genes (covering [Formula: see text]) of the genes in the active structure participate in the neuro-degeneration and its related pathways. This scheme can be extended straightforwardly to extract characteristics of trajectories of complex systems.

...read moreread less