scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2017"


Journal ArticleDOI
TL;DR: The ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.
Abstract: The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.

146 citations


Journal ArticleDOI
TL;DR: This work allows the hypothesis that one of the possible mechanisms for avoiding chaos in gene networks can be a negative evolutionary selection, which prevents fixation or realization of regulatory circuits, creating too mild conditions for the emergence of chaos.
Abstract: Today there are examples that prove the existence of chaotic dynamics at all levels of organization of living systems, except intracellular, although such a possibility has been theoretically predicted. The lack of experimental evidence of chaos generation at the intracellular level in vivo may indicate that during evolution the cell got rid of chaos. This work allows the hypothesis that one of the possible mechanisms for avoiding chaos in gene networks can be a negative evolutionary selection, which prevents fixation or realization of regulatory circuits, creating too mild, from the biological point of view, conditions for the emergence of chaos. It has been shown that one of such circuits may be a combination of negative autoregulation of expression of transcription factors at the level of their synthesis and degradation. The presence of such a circuit results in formation of multiple branches of chaotic solutions as well as formation of hyperchaos with equal and sufficiently low values of the delayed argument that can be implemented not only in eukaryotic, but in prokaryotic cells.

25 citations


Journal ArticleDOI
TL;DR: A method that represents key features of virus and human proteins of variable length into a feature vector of fixed length and uses the SVM model with gene ontology annotations of proteins to predict new HPV-human PPIs is developed.
Abstract: The interaction of virus proteins with host proteins plays a key role in viral infection and consequent pathogenesis. Many computational methods have been proposed to predict protein-protein interactions (PPIs), but most of the computational methods are intended for PPIs within a species rather than PPIs across different species such as virus-host PPIs. We developed a method that represents key features of virus and human proteins of variable length into a feature vector of fixed length. The key features include the relative frequency of amino acid triplets (RFAT), the frequency difference of amino acid triplets (FDAT) between virus and host proteins, and amino acid composition (AC). We constructed several support vector machine (SVM) models to evaluate our method and to compare our method with others on PPIs between human and two types of viruses: human papillomaviruses (HPV) and hepatitis C virus (HCV). Comparison of our method to others with same datasets of HPV-human PPIs and HCV-human PPIs showed that the performance of our method is significantly higher than others in all performance measures. Using the SVM model with gene ontology (GO) annotations of proteins, we predicted new HPV-human PPIs. We believe our approach will be useful in predicting heterogeneous PPIs.

24 citations


Journal ArticleDOI
TL;DR: The visualCMAT web-server can be used to understand the relationship between structure and function in proteins, implemented at selecting hotspots and compensatory mutations for rational design and directed evolution experiments to produce novel enzymes with improved properties, and employed at studying the mechanism of selective ligand's binding and allosteric communication between topologically independent sites in protein structures.
Abstract: The visualCMAT web-server was designed to assist experimental research in the fields of protein/enzyme biochemistry, protein engineering, and drug discovery by providing an intuitive and easy-to-use interface to the analysis of correlated mutations/co-evolving residues. Sequence and structural information describing homologous proteins are used to predict correlated substitutions by the Mutual information-based CMAT approach, classify them into spatially close co-evolving pairs, which either form a direct physical contact or interact with the same ligand (e.g. a substrate or a crystallographic water molecule), and long-range correlations, annotate and rank binding sites on the protein surface by the presence of statistically significant co-evolving positions. The results of the visualCMAT are organized for a convenient visual analysis and can be downloaded to a local computer as a content-rich all-in-one PyMol session file with multiple layers of annotation corresponding to bioinformatic, statistical and structural analyses of the predicted co-evolution, or further studied online using the built-in interactive analysis tools. The online interactivity is implemented in HTML5 and therefore neither plugins nor Java are required. The visualCMAT web-server is integrated with the Mustguseal web-server capable of constructing large structure-guided sequence alignments of protein families and superfamilies using all available information about their structures and sequences in public databases. The visualCMAT web-server can be used to understand the relationship between structure and function in proteins, implemented at selecting hotspots and compensatory mutations for rational design and directed evolution experiments to produce novel enzymes with improved properties, and employed at studying the mechanism of selective ligand's binding and allosteric communication between topologically independent sites in protein structures. The web-server is freely available at https://biokinet.belozersky.msu.ru/visualcmat and there are no login requirements.

15 citations


Journal ArticleDOI
TL;DR: This work used more than 3200 HIV-1 RT variants from the publicly available Stanford HIV RT and protease sequence database already tested for 10 anti-HIV drugs including both nucleoside and non-nucleoside RT inhibitors to suggest a particular amino acid residue and its position to describe primary structure-resistance relationships.
Abstract: HIV reverse transcriptase (RT) inhibitors targeting the early stages of virus–host interactions are of great interest to scientists. Acquired HIV RT resistance happens due to mutations in a particular region of the pol gene encoding the HIV RT amino acid sequence. We propose an application of the previously developed PASS algorithm for prediction of amino acid substitutions potentially involved in the resistance of HIV-1 based on open data. In our work, we used more than 3200 HIV-1 RT variants from the publicly available Stanford HIV RT and protease sequence database already tested for 10 anti-HIV drugs including both nucleoside and non-nucleoside RT inhibitors. We used a particular amino acid residue and its position to describe primary structure-resistance relationships. The average balanced accuracy of the prediction obtained in 20-fold cross-validation for the Phenosense dataset was about 88% and for the Antivirogram dataset was about 79%. Thus, the PASS-based algorithm may be used for prediction of the amino acid substitutions associated with the resistance of HIV-1 based on open data. The computational approach for the prediction of HIV-1 associated resistance can be useful for the selection of RT inhibitors for the treatment of HIV infected patients in the clinical practice. Prediction of the HIV-1 RT associated resistance can be useful for the development of new anti-HIV drugs active against the resistant variants of RT. Therefore, we propose that this study can be potentially useful for anti-HIV drug development.

15 citations


Journal ArticleDOI
TL;DR: The MAMMOTh database entries are organized as building blocks in a way that the model parts can be used in different combinations to describe systems with higher organizational level (metabolic pathways and/or transcription regulatory networks) and supports export of a single model or their combinations in SBML or Mathematica standards.
Abstract: Motivation: Living systems have a complex hierarchical organization that can be viewed as a set of dynamically interacting subsystems. Thus, to simulate the internal nature and dynamics of the entire biological system, we should use the iterative way for a model reconstruction, which is a consistent composition and combination of its elementary subsystems. In accordance with this bottom-up approach, we have developed the MAthematical Models of bioMOlecular sysTems (MAMMOTh) tool that consists of the database containing manually curated MAMMOTh fitted to the experimental data and a software tool that provides their further integration. Results: The MAMMOTh database entries are organized as building blocks in a way that the model parts can be used in different combinations to describe systems with higher organizational level (metabolic pathways and/or transcription regulatory networks). The tool supports export of a single model or their combinations in SBML or Mathematica standards. The database currently ...

14 citations


Journal ArticleDOI
TL;DR: A new metaheuristic namely Elephant Swarm Water Search Algorithm (ESWSA) to infer Gene Regulatory Network (GRN) is proposed, mainly based on the water search strategy of intelligent and social elephants during drought, utilizing the different types of communication techniques.
Abstract: Correct inference of genetic regulations inside a cell from the biological database like time series microarray data is one of the greatest challenges in post genomic era for biologists and researchers. Recurrent Neural Network (RNN) is one of the most popular and simple approach to model the dynamics as well as to infer correct dependencies among genes. Inspired by the behavior of social elephants, we propose a new metaheuristic namely Elephant Swarm Water Search Algorithm (ESWSA) to infer Gene Regulatory Network (GRN). This algorithm is mainly based on the water search strategy of intelligent and social elephants during drought, utilizing the different types of communication techniques. Initially, the algorithm is tested against benchmark small and medium scale artificial genetic networks without and with presence of different noise levels and the efficiency was observed in term of parametric error, minimum fitness value, execution time, accuracy of prediction of true regulation, etc. Next, the proposed algorithm is tested against the real time gene expression data of Escherichia Coli SOS Network and results were also compared with others state of the art optimization methods. The experimental results suggest that ESWSA is very efficient for GRN inference problem and performs better than other methods in many ways.

14 citations


Journal ArticleDOI
TL;DR: This study proposes explicitly modelling the mitochondrial double membrane structures, and acquiring the image edges by way of ridge detection rather than by image gradient, and utilizes group-similarity in context to further optimize the local misleading segmentation.
Abstract: It is possible now to look more closely into mitochondrial physical structures due to the rapid development of electron microscope (EM) Mitochondrial physical structures play important roles in both cellular physiology and neuronal functions Unfortunately, the segmentation of mitochondria from EM images has proven to be a difficult and challenging task, due to the presence of various subcellular structures, as well as image distortions in the sophisticated background Although the current state-of-the-art algorithms have achieved some promising results, they have demonstrated poor performances on these mitochondria which are in close proximity to vesicles or various membranes In order to overcome these limitations, this study proposes explicitly modelling the mitochondrial double membrane structures, and acquiring the image edges by way of ridge detection rather than by image gradient In addition, this study also utilizes group-similarity in context to further optimize the local misleading segmentation Then, the experimental results determined from the images acquired by automated tape-collecting ultramicrotome scanning electron microscopy (ATUM-SEM) demonstrate the effectiveness of this study’s proposed algorithm

14 citations



Journal ArticleDOI
TL;DR: A text mining tool called MPTM is developed, which extracts and organizes valuable knowledge about 11 common PTMs from abstracts in PubMed by using relations extracted from dependency parse trees and a heuristic algorithm.
Abstract: Due to the importance of post-translational modifications (PTMs) in human health and diseases, PTMs are regularly reported in the biomedical literature. However, the continuing and rapid pace of expansion of this literature brings a huge challenge for researchers and database curators. Therefore, there is a pressing need to aid them in identifying relevant PTM information more efficiently by using a text mining system. So far, only a few web servers are available for mining information of a very limited number of PTMs, which are based on simple pattern matching or pre-defined rules. In our work, in order to help researchers and database curators easily find and retrieve PTM information from available text, we have developed a text mining tool called MPTM, which extracts and organizes valuable knowledge about 11 common PTMs from abstracts in PubMed by using relations extracted from dependency parse trees and a heuristic algorithm. It is the first web server that provides literature mining service for hydroxylation, myristoylation and GPI-anchor. The tool is also used to find new publications on PTMs from PubMed and uncovers potential PTM information by large-scale text analysis. MPTM analyzes text sentences to identify protein names including substrates and protein-interacting enzymes, and automatically associates them with the UniProtKB protein entry. To facilitate further investigation, it also retrieves PTM-related information, such as human diseases, Gene Ontology terms and organisms from the input text and related databases. In addition, an online database (MPTMDB) with extracted PTM information and a local MPTM Lite package are provided on the MPTM website. MPTM is freely available online at http://bioinformatics.ustc.edu.cn/mptm/ and the source codes are hosted on GitHub: https://github.com/USTC-HILAB/MPTM .

9 citations


Journal ArticleDOI
TL;DR: This paper presents results for rearrangement problems that involve prefix and suffix versions of reversals and transpositions considering unsigned and signed permutations, and gives 2-approximation and ([Formula: see text])- approximation algorithms for these problems.
Abstract: Some interesting combinatorial problems have been motivated by genome rearrangements, which are mutations that affect large portions of a genome. When we represent genomes as permutations, the goal is to transform a given permutation into the identity permutation with the minimum number of rearrangements. When they affect segments from the beginning (respectively end) of the permutation, they are called prefix (respectively suffix) rearrangements. This paper presents results for rearrangement problems that involve prefix and suffix versions of reversals and transpositions considering unsigned and signed permutations. We give 2-approximation and (2+λ)-approximation algorithms for these problems, where λ is a constant divided by the number of breakpoints (pairs of consecutive elements that should not be consecutive in the identity permutation) in the input permutation. We also give bounds for the diameters concerning these problems and provide ways of improving the practical results of our algorithms.

Journal ArticleDOI
TL;DR: In silico methods, a number of miRNAs with predicted AhR, CAR, and ESRs binding sites that are known as oncogenes and as tumor suppressors are identified and will contribute to further investigation of epigenetic mechanisms of carcinogenesis.
Abstract: MicroRNAs (miRNAs) play important roles in the regulation of gene expression at the post-transcriptional level. Many exogenous compounds or xenobiotics may affect microRNA expression. It is a well-established fact that xenobiotics with planar structure like TCDD, benzo(a)pyrene (BP) can bind aryl hydrocarbon receptor (AhR) followed by its nuclear translocation and transcriptional activation of target genes. Another chemically diverse group of xenobiotics including phenobarbital, DDT, can activate the nuclear receptor CAR and in some cases estrogen receptors ESR1 and ESR2. We hypothesized that such chemicals can affect miRNA expression through the activation of AHR, CAR, and ESRs. To prove this statement, we used in silico methods to find DRE, PBEM, ERE potential binding sites for these receptors, respectively. We have predicted AhR, CAR, and ESRs binding sites in 224 rat, 201 mouse, and 232 human promoters of miRNA-coding genes. In addition, we have identified a number of miRNAs with predicted AhR, CAR, and ESRs binding sites that are known as oncogenes and as tumor suppressors. Our results, obtained in silico, open a new strategy for ongoing experimental studies and will contribute to further investigation of epigenetic mechanisms of carcinogenesis.

Journal ArticleDOI
TL;DR: This work developed MicroTarget to predict a microRNA-gene regulatory network using heterogeneous data sources, especially gene and microRNA expression data, and indicates that using expression data in target prediction is more accurate in terms of specificity and sensitivity.
Abstract: MicroRNAs are known to play an essential role in gene regulation in plants and animals. The standard method for understanding microRNA-gene interactions is randomized controlled perturbation experiments. These experiments are costly and time consuming. Therefore, use of computational methods is essential. Currently, several computational methods have been developed to discover microRNA target genes. However, these methods have limitations based on the features that are used for prediction. The commonly used features are complementarity to the seed region of the microRNA, site accessibility, and evolutionary conservation. Unfortunately, not all microRNA target sites are conserved or adhere to exact seed complementary, and relying on site accessibility does not guarantee that the interaction exists. Moreover, the study of regulatory interactions composed of the same tissue expression data for microRNAs and mRNAs is necessary to understand the specificity of regulation and function. We developed MicroTarget to predict a microRNA-gene regulatory network using heterogeneous data sources, especially gene and microRNA expression data. First, MicroTarget employs expression data to learn a candidate target set for each microRNA. Then, it uses sequence data to provide evidence of direct interactions. MicroTarget scores and ranks the predicted targets based on a set of features. The predicted targets overlap with many of the experimentally validated ones. Our results indicate that using expression data in target prediction is more accurate in terms of specificity and sensitivity. Available at: https://bioinformatics.cs.vt.edu/~htorkey/microTarget .

Journal ArticleDOI
TL;DR: This study revealed that miRNA targets function primarily in cell cycle processes and overrepresentation of the targets observed in the next two consecutive phylostrata, opisthokonta and eumetazoa, corresponded to the expansion periods of miRNAs in animals evolution.
Abstract: The evolutionary history and origin of the regulatory function of animal non-coding RNAs are not well understood. Lack of conservation of long non-coding RNAs and small sizes of microRNAs has been major obstacles in their phylogenetic analysis. In this study, we tried to shed more light on the evolution of ncRNA regulatory networks by changing our phylogenetic strategy to focus on the evolutionary pattern of their protein coding targets. We used available target databases of miRNAs and lncRNAs to find their protein coding targets in human. We were able to recognize evolutionary hallmarks of ncRNA targets by phylostratigraphic analysis. We found the conventional 3'-UTR and lesser known 5'-UTR targets of miRNAs to be enriched at three consecutive phylostrata. Firstly, in eukaryata phylostratum corresponding to the emergence of miRNAs, our study revealed that miRNA targets function primarily in cell cycle processes. Moreover, the same overrepresentation of the targets observed in the next two consecutive phylostrata, opisthokonta and eumetazoa, corresponded to the expansion periods of miRNAs in animals evolution. Coding sequence targets of miRNAs showed a delayed rise at opisthokonta phylostratum, compared to the 3' and 5' UTR targets of miRNAs. LncRNA regulatory network was the latest to evolve at eumetazoa.

Journal ArticleDOI
TL;DR: A sophisticated supervised learning method is taken to design a breast cancer grading predictor fusing heterogeneous data for classification of breast cancer histopathology that outperforms other state-of-the-art methods and has abundant biological interpretation in explaining differences between breast cancer grades.
Abstract: Breast cancer histologic grade represents the morphological assessment of the tumor's malignancy and aggressiveness, which is vital in clinically planning treatment and estimating prognosis for patients. Therefore, the prediction of breast cancer grade can markedly elevate the detection of early breast cancer and efficiently guide its treatment. With the advent of high-throughput profiling technology, a large number of data of different types are rapidly generated, and each data provides its unique biological insight. Although many researches focused on cancer grade prediction, hardly most of them attempted to integrate multiple data types, by which we cannot only improve and boost results obtained from learning method, but also have a good understanding or explanation of biological issues. In this paper, we take advantage of a sophisticated supervised learning method called multiple kernel learning (MKL) to design a breast cancer grading predictor fusing heterogeneous data for classification of breast cancer histopathology. Furthermore, we modify our model by involving biological pathway information. The new model can evaluate the significance of various pathways in which differential expression genes fall between different breast cancer grades. The merits of the novel model are lucubration in bridging between omics data and various phenotypes of breast cancer grades, and providing an auxiliary method integrating omics data of cancer mechanism research. In experiments, the proposed method outperforms other state-of-the-art methods and has abundant biological interpretation in explaining differences between breast cancer grades.

Journal ArticleDOI
TL;DR: This work describes efficient algorithms that are adopting the strict consensus approach to also handle unrestricted supertree problems and demonstrates the performance of these algorithms in a comparative study with classic supertree heuristics using simulated and empirical data sets.
Abstract: Supertree problems are a standard tool for synthesizing large-scale species trees from a given collection of gene trees under some problem-specific objective. Unfortunately, these problems are typically NP-hard, and often remain so when their instances are restricted to rooted gene trees sampled from the same species. While a class of restricted supertree problems has been effectively addressed by the parameterized strict consensus approach, in practice, most gene trees are unrooted and sampled from different species. Here, we overcome this stringent limitation by describing efficient algorithms that are adopting the strict consensus approach to also handle unrestricted supertree problems. Finally, we demonstrate the performance of our algorithms in a comparative study with classic supertree heuristics using simulated and empirical data sets.

Journal ArticleDOI
TL;DR: Compared to existing methods, WDNfinder can significantly narrow down the set of minimum driver node set (MDS) under the restriction of domain knowledge and shows high accuracy on essential nodes prediction in these networks.
Abstract: Structural controllability is the generalization of traditional controllability for dynamical systems. During the last decade, interesting biological discoveries have been inferred by applied structural controllability analysis to biological networks. However, false positive/negative information (i.e. nodes and edges) widely exists in biological networks that documented in public data sources, which can hinder accurate analysis of structural controllability. In this study, we propose WDNfinder, a comprehensive analysis package that provides structural controllability with consideration of node connection strength in biological networks. When applied to the human cancer signaling network and p53-mediate DNA damage response network, WDNfinder shows high accuracy on essential nodes prediction in these networks. Compared to existing methods, WDNfinder can significantly narrow down the set of minimum driver node set (MDS) under the restriction of domain knowledge. When using p53-mediate DNA damage response network as illustration, we find more meaningful MDSs by WDNfinder. The source code is implemented in python and publicly available together with relevant data on GitHub: https://github.com/dustincys/WDNfinder .

Journal ArticleDOI
TL;DR: A first principles model, developed here for the E. coli QS system, builds on known mechanistic detail and is used to develop a working model of LuxS-regulated (Lsr) activity, meant to discriminate among hypothetical mechanisms governing lsr transcriptional regulation.
Abstract: Quorum sensing (QS) enables bacterial communication and collective behavior in response to self-secreted signaling molecules. Unlocking its genetic regulation will provide insight towards understanding its influence on pathogenesis, formation of biofilms, and many other phenotypes. There are few datasets available that link QS-mediated gene expression to its regulatory components and even fewer mathematical models that incorporate known mechanistic detail. By integrating these data with annotated sequence information, mathematical inferences can be pieced together that shed light on regulatory structure. A first principles model, developed here for the E. coli QS system, builds on known mechanistic detail and is used to develop a working model of LuxS-regulated (Lsr) activity. That is, our model is meant to discriminate among hypothetical mechanisms governing lsr transcriptional regulation. Our simulations are in qualitative agreement with experimentally observed data. Importantly, our results point to the importance of transcriptional regulator, LsrR, cycling on genetic control. We also found several experimental observations in E. coli and homologous systems that were not explained by current mechanistic understanding. For example, by comparing simulations with reports of the integrating host factor in Aggrigatibacter actinomycetemcomitans, we conclude that additional transcriptional components are likely involved. An iterative process of simulation and experiment, therefore, is needed to inform new experiments and incorporate new model detail, the benefit of which will more rapidly validate mechanistic understanding.

Journal ArticleDOI
TL;DR: This paper introduces a series of new features with 80 dimension called short sequence motifs (SSM) and uses a neural network algorithm called voting-based extreme learning machine (V-ELM) to identify real piRNAs, showing that this method is more effective compared with those of the piRPred, piRNApredictor, Asym-Pibomd, Piano and McRUMs.
Abstract: Piwi-interacting RNAs (piRNAs) were recently discovered as endogenous small noncoding RNAs. Some recent research suggests that piRNAs may play an important role in cancer. So the precise identification of human piRNAs is a significant work. In this paper, we introduce a series of new features with 80 dimension called short sequence motifs (SSM). A hybrid feature vector with 1444 dimension can be formed by combining 1364 features of [Formula: see text]-mer strings and 80 features of SSM features. We optimize the 1444 dimension features using the feature score criterion (FSC) and list them in descending order according to the scores. The first 462 are selected as the input feature vector in the classifier. Moreover, eight of 80 SSM features appear in the top 20. This indicates that these eight SSM features play an important part in the identification of piRNAs. Since five of the above eight SSM features are associated with nucleotide A and G ('A*G', 'A**G', 'A***G', 'A****G', 'A*****G'). So, we guess there may exist some biological significance. We also use a neural network algorithm called voting-based extreme learning machine (V-ELM) to identify real piRNAs. The Specificity (Sp) and Sensitivity (Sn) of our method are 95.48% and 94.61%, respectively in human species. This result shows that our method is more effective compared with those of the piRPred, piRNApredictor, Asym-Pibomd, Piano and McRUMs. The web service of V-ELMpiRNAPred is available for free at http://mm20132014.wicp.net:38601/velmprepiRNA/Main.jsp .

Journal ArticleDOI
TL;DR: This paper presents a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance, and shows that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster.
Abstract: The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

Journal ArticleDOI
TL;DR: This paper addresses the challenge of predicting the tertiary structure of a given amino acid sequence, which has been reported to belong to the NP-Complete class of problems and presents a new method, namely NEAT-FLEX, based on NeuroEvolution of Augmenting Topologies (NEAT) to extract structural features from (ABS) proteins that are determined experimentally.
Abstract: The development of computational methods to accurately model three-dimensional protein structures from sequences of amino acid residues is becoming increasingly important to the structural biology field. This paper addresses the challenge of predicting the tertiary structure of a given amino acid sequence, which has been reported to belong to the NP-Complete class of problems. We present a new method, namely NEAT–FLEX, based on NeuroEvolution of Augmenting Topologies (NEAT) to extract structural features from (ABS) proteins that are determined experimentally. The proposed method manipulates structural information from the Protein Data Bank (PDB) and predicts the conformational flexibility (FLEX) of residues of a target amino acid sequence. This information may be used in three-dimensional structure prediction approaches as a way to reduce the conformational search space. The proposed method was tested with 24 different amino acid sequences. Evolving neural networks were compared against a traditional error back-propagation algorithm; results show that the proposed method is a powerful way to extract and represent structural information from protein molecules that are determined experimentally.

Journal ArticleDOI
TL;DR: This work compared the contribution of each of the eight search engines integrated in an open-source graphical user interface SearchGUI into total result of proteoforms identification and optimized set of engines working simultaneously, and selected combination of X!Tandem, MS-GF and OMMSA as the most time-efficient and productive combination of search.
Abstract: Proteomic challenges, stirred up by the advent of high-throughput technologies, produce large amount of MS data. Nowadays, the routine manual search does not satisfy the "speed" of modern science any longer. In our work, the necessity of single-thread analysis of bulky data emerged during interpretation of HepG2 proteome profiling results for proteoforms searching. We compared the contribution of each of the eight search engines (X!Tandem, MS-GF[Formula: see text], MS Amanda, MyriMatch, Comet, Tide, Andromeda, and OMSSA) integrated in an open-source graphical user interface SearchGUI ( http://searchgui.googlecode.com ) into total result of proteoforms identification and optimized set of engines working simultaneously. We also compared the results of our search combination with Mascot results using protein kit UPS2, containing 48 human proteins. We selected combination of X!Tandem, MS-GF[Formula: see text] and OMMSA as the most time-efficient and productive combination of search. We added homemade java-script to automatize pipeline from file picking to report generation. These settings resulted in rise of the efficiency of our customized pipeline unobtainable by manual scouting: the analysis of 192 files searched against human proteome (42153 entries) downloaded from UniProt took 11[Formula: see text]h.

Journal ArticleDOI
TL;DR: In this study, a support vector machine (SVM)-based method was proposed through integrating PSI-BLAST profile, physicochemical properties, amino acid compositions, and pseudo AACs into the principal feature vector, and a recursive feature selection scheme was subsequently implemented to single out the most discriminative features.
Abstract: Palmitoylation is the covalent attachment of lipids to amino acid residues in proteins. As an important form of protein posttranslational modification, it increases the hydrophobicity of proteins, which contributes to the protein transportation, organelle localization, and functions, therefore plays an important role in a variety of cell biological processes. Identification of palmitoylation sites is necessary for understanding protein-protein interaction, protein stability, and activity. Since conventional experimental techniques to determine palmitoylation sites in proteins are both labor intensive and costly, a fast and accurate computational approach to predict palmitoylation sites from protein sequences is in urgent need. In this study, a support vector machine (SVM)-based method was proposed through integrating PSI-BLAST profile, physicochemical properties, [Formula: see text]-mer amino acid compositions (AACs), and [Formula: see text]-mer pseudo AACs into the principal feature vector. A recursive feature selection scheme was subsequently implemented to single out the most discriminative features. Finally, an SVM method was implemented to predict palmitoylation sites in proteins based on the optimal features. The proposed method achieved an accuracy of 99.41% and Matthews Correlation Coefficient of 0.9773 for a benchmark dataset. The result indicates the efficiency and accuracy of our method in prediction of palmitoylation sites based on protein sequences.

Journal ArticleDOI
TL;DR: A shift from the traditional view of protein ortholog groups as hard-clusters to soft-cluster is proposed and the MinDRPGT problem is studied, which consists in finding a protein supertree and a gene tree minimizing a double reconciliation cost, given a species tree and a set of protein subtrees.
Abstract: The architecture of eukaryotic coding genes allows the production of several different protein isoforms by genes. Current gene phylogeny reconstruction methods make use of a single protein product per gene, ignoring information on alternative protein isoforms. These methods often lead to inaccurate gene tree reconstructions that require to be corrected before phylogenetic analyses. Here, we propose a new approach for the reconstruction of gene trees and protein trees accounting for alternative protein isoforms. We extend the concept of reconciliation to protein trees, and we define a new reconciliation problem called MinDRGT that consists in finding a gene tree that minimizes a double reconciliation cost with a given protein tree and a given species tree. We define a second problem called MinDRPGT that consists in finding a protein supertree and a gene tree minimizing a double reconciliation cost, given a species tree and a set of protein subtrees. We propose a shift from the traditional view of protein ortholog groups as hard-clusters to soft-clusters and we study the MinDRPGT problem under this assumption. We provide algorithmic exact and heuristic solutions for versions of the problems, and we present the results of applications on protein and gene trees from the Ensembl database. The implementations of the methods are available at https://github.com/UdeS-CoBIUS/Protein2GeneTree and https://github.com/UdeS-CoBIUS/SuperProteinTree .

Journal ArticleDOI
TL;DR: The developed software for analysis of telomere length in patients with rheumatoid arthritis was demonstrated and the MeTeLen software contains new options that can be used to solve some Q-FISH and microscopy problems, including correction of irregular light effects and elimination of background fluorescence.
Abstract: Telomere length is an important indicator of proliferative cell history and potential. Decreasing telomere length in the cells of an immune system can indicate immune aging in immune-mediated and chronic inflammatory diseases. Quantitative fluorescent in situ hybridization (Q-FISH) of a labeled (C3TA2)3 peptide nucleic acid probe onto fixed metaphase cells followed by digital image microscopy allows the evaluation of telomere length in the arms of individual chromosomes. Computer-assisted analysis of microscopic images can provide quantitative information on the number of telomeric repeats in individual telomeres. We developed new software to estimate telomere length. The MeTeLen software contains new options that can be used to solve some Q-FISH and microscopy problems, including correction of irregular light effects and elimination of background fluorescence. The identification and description of chromosomes and chromosome regions are essential to the Q-FISH technique. To improve the quality of cytogenetic analysis after Q-FISH, we optimized the temperature and time of DNA-denaturation to get better DAPI-banding of metaphase chromosomes. MeTeLen was tested by comparing telomere length estimations for sister chromatids, background fluorescence estimations, and correction of nonuniform light effects. The application of the developed software for analysis of telomere length in patients with rheumatoid arthritis was demonstrated.

Journal ArticleDOI
TL;DR: A fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data by partitioning the large number of sequence reads into groups (called canopies) using hashing and demonstrating the ability of this approach to determine meaningful Operational Taxonomic Units (OTU) is demonstrated.
Abstract: Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S a...

Journal ArticleDOI
TL;DR: A flexible approach that empowers multi-source information reconciliation for quality gene prioritization that augments the complementary nature of various learning sources so as to utilize the maximum information of aggregated data is proposed.
Abstract: In complex disorders, collaborative role of several genes accounts for the multitude of symptoms and the discovery of molecular mechanisms requires proper understanding of pertinent genes. Majority of the recent techniques utilize either single information or consolidate the independent outlook from multiple knowledge sources for assisting the discovery of candidate genes. In any case, given that various sorts of heterogeneous sources are possibly significant for quality gene prioritization, every source bearing data not conveyed by another, we assert that a perfect strategy ought to give approaches to observe among them in a genuine integrative style that catches the degree of each, instead of utilizing a straightforward mix of sources. We propose a flexible approach that empowers multi-source information reconciliation for quality gene prioritization that augments the complementary nature of various learning sources so as to utilize the maximum information of aggregated data. To illustrate the proposed approach, we took Autism Spectrum Disorder (ASD) as a case study and validated the framework on benchmark studies. We observed that the combined ranking based on integrated knowledge reduces the false positive observations and boosts the performance when compared with individual rankings. The clinical phenotype validation for ASD shows that there is a significant linkage between top positioned genes and endophenotypes of ASD. Categorization of genes based on endophenotype associations by this method will be useful for further hypothesis generation leading to clinical and translational analysis. This approach may also be useful in other complex neurological and psychiatric disorders with a strong genetic component.

Journal ArticleDOI
TL;DR: This work introduces an efficient alignment-free approach to estimate abundances of microbial genomes in metagenomic samples based on solving linear and quadratic programs, which are represented by genome-specific markers (GSM).
Abstract: Determining abundances of microbial genomes in metagenomic samples is an important problem in analyzing metagenomic data. Although homology-based methods are popular, they have shown to be computat...

Journal ArticleDOI
Bo Liao1, Xiangjun Wang1, Wen Zhu1, Xiong Li1, Lijun Cai1, Haowen Chen1 
TL;DR: The experimental results show that this new LD measure can be directly applied to genotype dataset collected from the HapMap project, so that it saves the cost of haplotyping and improves the efficiency and prediction accuracy of tag SNP selection.
Abstract: Numerous approaches have been proposed for selecting an optimal tag single-nucleotide polymorphism (SNP) set. Most of these approaches are based on linkage disequilibrium (LD). Classical LD measures, such as D' and r2, are frequently used to quantify the relationship between two marker (pairwise) linkage disequilibria. Despite of their successful use in many applications, these measures cannot be used to measure the LD between multiple-marker. These LD measures need information about the frequencies of alleles collected from haplotype dataset. In this study, a cluster algorithm is proposed to cluster SNPs according to multilocus LD measure which is based on information theory. After that, tag SNPs are selected in each cluster optimized by the number of tag SNPs, prediction accuracy and so on. The experimental results show that this new LD measure can be directly applied to genotype dataset collected from the HapMap project, so that it saves the cost of haplotyping. More importantly, the proposed method significantly improves the efficiency and prediction accuracy of tag SNP selection.

Journal ArticleDOI
TL;DR: No single method can beat others in all criteria but it seems that the methods introduced by Guimera and Amaral and Verwoerd do better on among metabolite-based methods and the method introduced by Sridharan et al. does better among reaction-based ones.
Abstract: A metabolic network model provides a computational framework for studying the metabolism of a cell at the system level. The organization of metabolic networks has been investigated in different studies. One of the organization aspects considered in these studies is the decomposition of a metabolic network. The decompositions produced by different methods are very different and there is no comprehensive evaluation framework to compare the results with each other. In this study, these methods are reviewed and compared in the first place. Then they are applied to six different metabolic network models and the results are evaluated and compared based on two existing and two newly proposed criteria. Results show that no single method can beat others in all criteria but it seems that the methods introduced by Guimera and Amaral and Verwoerd do better on among metabolite-based methods and the method introduced by Sridharan et al. does better among reaction-based ones. Also, the methods are applied to several artificial networks, each constructed from merging a few KEGG pathways. Then, their capability to recover those pathways are compared. Results show that among metabolite-based methods, the method of Guimera and Amaral does better again, however, no notable difference between the performances of reaction-based methods was detected.