scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2011"


Journal ArticleDOI
TL;DR: A comparison of a number of data-driven normalization methods for TaqMan low-density arrays for qPCR and different descriptive statistical techniques that can facilitate the choice of normalization method shows that the data- driven methods reduce variation and represent robust alternatives to using endogenous controls.
Abstract: Low-density arrays for quantitative real-time PCR (qPCR) are increasingly being used as an experimental technique for miRNA expression profiling. As with gene expression profiling using microarrays, data from such experiments needs effective analysis methods to produce reliable and high-quality results. In the pre-processing of the data, one crucial analysis step is normalization, which aims to reduce measurement errors and technical variability among arrays that might have arisen during the execution of the experiments. However, there are currently a number of different approaches to choose among and an unsuitable applied method may induce misleading effects, which could affect the subsequent analysis steps and thereby any conclusions drawn from the results. The choice of normalization method is hence an important issue to consider. In this study we present the comparison of a number of data-driven normalization methods for TaqMan low-density arrays for qPCR and different descriptive statistical techniques that can facilitate the choice of normalization method. The performance of the normalization methods was assessed and compared against each other as well as against standard normalization using endogenous controls. The results clearly show that the data-driven methods reduce variation and represent robust alternatives to using endogenous controls.

57 citations


Journal ArticleDOI
TL;DR: The new program package JAGUC is a tool that bridges the gap between computational and biological sciences that enables biologists to process large sequence data sets in order to infer biological meaning from hundreds of thousands of raw sequence data.
Abstract: Background: The study of microbial diversity and community structures heavily relies on the analyses of sequence data, predominantly taxonomic marker genes like the small subunit of the ribosomal RNA (SSU rRNA) amplified from environmental samples. Until recently, the "gold standard" for this strategy was the cloning and Sanger sequencing of amplified target genes, usually restricted to a few hundred sequences per sample due to relatively high costs and labor intensity. The recent introduction of massive parallel tag sequencing strategies like pyrosequencing (454 sequencing) has opened a new window into microbial biodiversity research. Due to its swift nature and relatively low expense, this strategy produces millions of environmental SSU rDNA sequences granting the opportunity to gain deep insights into the true diversity and complexity of microbial communities. The bottleneck, however, is the computational processing of these massive sequence data, without which, biologists are hardly able to exploit the full information included in these sequence data. Results: The freely available standalone software package JAGUC implements a broad regime of different functions, allowing for efficient and convenient processing of a huge number of sequence tags, including importing custom-made reference data bases for basic local alignment searches, user-defined quality and search filters for analyses of specific sets of sequences, pairwise alignment-based sequence similarity calculations and clustering as well as sampling saturation and rank abundance analyses. In initial applications, JAGUC successfully analyzed hundreds of thousands of sequence data (eukaryote SSU rRNA genes) from aquatic samples and also was applied for quality assessments of different pyrosequencing platforms. Conclusions: The new program package JAGUC is a tool that bridges the gap between computational and biological sciences. It enables biologists to process large sequence data sets in order to infer biological meaning from hundreds of thousands of raw sequence data. JAGUC offers advantages over available tools which are further discussed in this manuscript.

43 citations


Journal ArticleDOI
TL;DR: CScore, a data-driven scoring function using a modified Cerebellar Model Articulation Controller (CMAC) learning architecture, is presented for accurate binding affinity prediction and it is shown that CScore will perform better if sufficient and relevant data is presented.
Abstract: Protein-ligand docking is a computational method to identify the binding mode of a ligand and a target protein, and predict the corresponding binding affinity using a scoring function. This method has great value in drug design. After decades of development, scoring functions nowadays typically can identify the true binding mode, but the prediction of binding affinity still remains a major problem. Here we present CScore, a data-driven scoring function using a modified Cerebellar Model Articulation Controller (CMAC) learning architecture, for accurate binding affinity prediction. The performance of CScore in terms of correlation between predicted and experimental binding affinities is benchmarked under different validation approaches. CScore achieves a prediction with R = 0.7668 and RMSE = 1.4540 when tested on an independent dataset. To the best of our knowledge, this result outperforms other scoring functions tested on the same dataset. The performance of CScore varies on different clusters under the leave-cluster-out validation approach, but still achieves competitive result. Lastly, the target-specified CScore achieves an even better result with R = 0.8237 and RMSE = 1.0872, trained on a much smaller but more relevant dataset for each target. The large dataset of protein-ligand complexes structural information and advances of machine learning techniques enable the data-driven approach in binding affinity prediction. CScore is capable of accurate binding affinity prediction. It is also shown that CScore will perform better if sufficient and relevant data is presented. As there is growth of publicly available structural data, further improvement of this scoring scheme can be expected.

40 citations


Journal ArticleDOI
TL;DR: The work in this paper provides an approach to enumerate the top-ranked possible topologies instead of enumerating the entire population of the topologies, particularly practical for large proteins.
Abstract: Electron cryo-microscopy is a fast advancing biophysical technique to derive three-dimensional structures of large protein complexes. Using this technique, many density maps have been generated at intermediate resolution such as 6–10 A resolution. Although it is challenging to derive the backbone of the protein directly from such density maps, secondary structure elements such as helices and β-sheets can be computationally detected. Our work in this paper provides an approach to enumerate the top-ranked possible topologies instead of enumerating the entire population of the topologies. This approach is particularly practical for large proteins. We developed a directed weighted graph, the topology graph, to represent the secondary structure assignment problem. We prove that the problem of finding the valid topology with the minimum cost is NP hard. We developed an O(N2 2N) dynamic programming algorithm to identify the topology with the minimum cost. The test of 15 proteins suggests that our dynamic programming approach is feasible to work with proteins of much larger size than we could before. The largest protein in the test contains 18 helical sticks detected from the density map out of 33 helices in the protein.

38 citations


Journal ArticleDOI
TL;DR: An overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates is provided and some suggestions about how future studies might further improve the accuracy of prote enzyme substrate specificity prediction are provided.
Abstract: Proteases have central roles in "life and death" processes due to their important ability to catalytically hydrolyze protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the "substrate identification" problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction.

34 citations


Journal ArticleDOI
TL;DR: This work proposes two strategies to enhance the sampling of conformations near the native state: an enhanced fragment library with greater structural diversity is used to expand the search space in the context of fragment-based assembly and only a representative subset of the sampled conformations is retained to further guide the search towards thenative state.
Abstract: The three-dimensional structure of a protein is a key determinant of its biological function. Given the cost and time required to acquire this structure through experimental means, computational models are necessary to complement wet-lab efforts. Many computational techniques exist for navigating the high-dimensional protein conformational search space, which is explored for low-energy conformations that comprise a protein's native states. This work proposes two strategies to enhance the sampling of conformations near the native state. An enhanced fragment library with greater structural diversity is used to expand the search space in the context of fragment-based assembly. To manage the increased complexity of the search space, only a representative subset of the sampled conformations is retained to further guide the search towards the native state. Our results make the case that these two strategies greatly enhance the sampling of the conformational space near the native state. A detailed comparative analysis shows that our approach performs as well as state-of-the-art ab initio structure prediction protocols.

29 citations


Journal ArticleDOI
TL;DR: Twelve genes have been identified and verified to be directly correlated to pancreatic cancer survival time and can be used for the prediction of future patient's survival.
Abstract: Pancreatic cancer is the fourth leading cause of cancer deaths in the United States with five-year survival rates less than 5% due to rare detection in early stages. Identification of genes that are directly correlated to pancreatic cancer survival is crucial for pancreatic cancer diagnostics and treatment. However, no existing GWAS or transcriptome studies are available for addressing this problem. We apply lasso penalized Cox regression to a transcriptome study to identify genes that are directly related to pancreatic cancer survival. This method is capable of handling the right censoring effect of survival times and the ultrahigh dimensionality of genetic data. A cyclic coordinate descent algorithm is employed to rapidly select the most relevant genes and eliminate the irrelevant ones. Twelve genes have been identified and verified to be directly correlated to pancreatic cancer survival time and can be used for the prediction of future patient's survival.

28 citations


Journal ArticleDOI
TL;DR: An integer linear programming (ILP) based assignment system (IPASS) that has enabled fully automatic protein structure determination for four test proteins and achieves an average precision and recall higher than the next best method.
Abstract: Error tolerant backbone resonance assignment is the cornerstone of the NMR structure determination process. Although a variety of assignment approaches have been developed, none works sufficiently well on noisy fully automatically picked peaks to enable the subsequent automatic structure determination steps. We have designed an integer linear programming (ILP) based assignment system (IPASS) that has enabled fully automatic protein structure determination for four test proteins. IPASS employs probabilistic spin system typing based on chemical shifts and secondary structure predictions. Furthermore, IPASS extracts connectivity information from the inter-residue information and the (automatically picked) 15N-edited NOESY peaks which are then used to fix reliable fragments. When applied to automatically picked peaks for real proteins, IPASS achieves an average precision and recall of 82% and 63%, respectively. In contrast, the next best method, MARS, achieves an average precision and recall of 77% and 36%, respectively. The assignments generated by IPASS are then fed into our protein structure calculation system, FALCON-NMR, to determine the 3D structures without human intervention. The final models have backbone RMSDs of 1.25A, 0.88A, 1.49A, and 0.67A to the reference native structures for proteins TM1112, CASKIN, VRAR, and HACS1, respectively. The web server is publicly available at .

27 citations


Journal ArticleDOI
TL;DR: This work uses SSA to calculate semantic similarity between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins, and shows that SSA is highly competitive with the other methods.
Abstract: Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.

26 citations


Journal ArticleDOI
TL;DR: This paper develops a computational method to select sets of mutations predicted to delete immunogenic T-cell epitopes, as evaluated by a 9-mer potential, while simultaneously maintaining important residues and residue interactions, as evaluation by one- and two-body potentials.
Abstract: Exogenous enzymes, signaling peptides, and other classes of nonhuman proteins represent a potentially massive but largely untapped pool of biotherapeutic agents. Adapting a foreign protein for therapeutic use poses numerous design challenges. We focus here on one significant problem: modifying the protein to mitigate the immune response mounted against "non-self" proteins, while not adversely affecting the protein's stability or therapeutic activity. In order to propose such variants suitable for experimental evaluation, this paper develops a computational method to select sets of mutations predicted to delete immunogenic T-cell epitopes, as evaluated by a 9-mer potential, while simultaneously maintaining important residues and residue interactions, as evaluated by one- and two-body potentials. While this design problem is NP-hard, we develop an integer programming approach that works very well in practice. We demonstrate the effectiveness of our approach by developing plans for biotherapeutic proteins that, in previous studies, have been partially deimmunized via extensive experimental characterization and modification of limited segments. In contrast, our global optimization technique considers an entire protein and accounts for all residues, residue interactions, and epitopes in proposing candidates worth subjecting to experimental evaluation.

25 citations


Journal ArticleDOI
TL;DR: The proposed method provides a good correlation between the predicted and experimental folding rates, and the comparative results demonstrate that this correlation is better than most of other methods, and suggest the important contribution of sequence order information to the determination of protein folding rates.
Abstract: Predicting protein folding rate from amino acid sequence is an important challenge in computational and molecular biology. Over the past few years, many methods have been developed to reflect the correlation between the folding rates and protein structures and sequences. In this paper, we present an effective method, a combined neural network--genetic algorithm approach, to predict protein folding rates only from amino acid sequences, without any explicit structural information. The originality of this paper is that, for the first time, it tackles the effect of sequence order. The proposed method provides a good correlation between the predicted and experimental folding rates. The correlation coefficient is 0.80 and the standard error is 2.65 for 93 proteins, the largest such databases of proteins yet studied, when evaluated with leave-one-out jackknife test. The comparative results demonstrate that this correlation is better than most of other methods, and suggest the important contribution of sequence order information to the determination of protein folding rates.

Journal ArticleDOI
TL;DR: The Immune Algorithm (IA), a heuristic search method inspired by the biological mechanism of acquired immunity, was applied to search for the S-system parameters and showed higher performance than GA for both simulation and real data analyses.
Abstract: The S-system model is one of the nonlinear differential equation models of gene regulatory networks, and it can describe various dynamics of the relationships among genes. If we successfully infer rigorous S-system model parameters that describe a target gene regulatory network, we can simulate gene expressions mathematically. However, the problem of finding an optimal S-system model parameter is too complex to be solved analytically. Thus, some heuristic search methods that offer approximate solutions are needed for reducing the computational time. In previous studies, several heuristic search methods such as Genetic Algorithms (GAs) have been applied to the parameter search of the S-system model. However, they have not achieved enough estimation accuracy. One of the conceivable reasons is that the mechanisms to escape local optima. We applied an Immune Algorithm (IA) to search for the S-system parameters. IA is also a heuristic search method, which is inspired by the biological mechanism of acquired immunity. Compared to GA, IA is able to search large solution space, thereby avoiding local optima, and have multiple candidates of the solutions. These features work well for searching the S-system model. Actually, our algorithm showed higher performance than GA for both simulation and real data analyses.

Journal ArticleDOI
TL;DR: This work demonstrates that the CS method can effectively detect subtypes of leukemia, implying improved accuracy of diagnosis of leukemia.
Abstract: With the development of genomic techniques, the demand for new methods that can handle high-throughput genome-wide data effectively is becoming stronger than ever before. Compressed sensing (CS) is an emerging approach in statistics and signal processing. With the CS theory, a signal can be uniquely reconstructed or approximated from its sparse representations, which can therefore better distinguish different types of signals. However, the application of CS approach to genome-wide data analysis has been rarely investigated. We propose a novel CS-based approach for genomic data classification and test its performance in the subtyping of leukemia through gene expression analysis. The detection of subtypes of cancers such as leukemia according to different genetic markups is significant, which holds promise for the individualization of therapies and improvement of treatments. In our work, four statistical features were employed to select significant genes for the classification. With our selected genes out of 7,129 ones, the proposed CS method achieved a classification accuracy of 97.4% when evaluated with the cross validation and 94.3% when evaluated with another independent data set. The robustness of the method to noise was also tested, giving good performance. Therefore, this work demonstrates that the CS method can effectively detect subtypes of leukemia, implying improved accuracy of diagnosis of leukemia.

Journal ArticleDOI
TL;DR: In this article, a computational approach was proposed to model large tumor cell populations and spheroids, and general considerations that apply to any fine-grained numerical model of tumors were discussed.
Abstract: The speed and the versatility of today's computers open up new opportunities to simulate complex biological systems. Here we review a computational approach recently proposed by us to model large tumor cell populations and spheroids, and we put forward general considerations that apply to any fine-grained numerical model of tumors. We discuss ways to bypass computational limitations and discuss our incremental approach, where each step is validated by experimental observations on a quantitative basis. We present a few results on the growth of tumor cells in closed and open environments and of tumor spheroids. This study suggests new ways to explore the initial growth phase of solid tumors and to optimize antitumor treatments.

Journal ArticleDOI
TL;DR: This paper presents hidden Markov random field regression with L(1) penalty to uncover the regulatory network structure for different species and provides a framework for sharing information across species via hidden component graphs and is able to incorporate domain knowledge across species easily.
Abstract: Many genes and biological processes function in similar ways across different species. Cross-species gene expression analysis, as a powerful tool to characterize the dynamical properties of the cell, has found a number of applications, such as identifying a conserved core set of cell cycle genes. However, to the best of our knowledge, there is limited effort on developing appropriate techniques to capture the causality relations between genes from time-series microarray data across species. In this paper, we present hidden Markov random field regression with L(1) penalty to uncover the regulatory network structure for different species. The algorithm provides a framework for sharing information across species via hidden component graphs and is able to incorporate domain knowledge across species easily. We demonstrate our method on two synthetic datasets and apply it to discover causal graphs from innate immune response data.

Journal ArticleDOI
TL;DR: The results indicate that the proposed method can accurately identify clusters in the simulated dataset, and the functional modules of the backbone network are more biologically relevant than those obtained from the original approach.
Abstract: Relationships among gene expression levels may be associated with the mechanisms of the disease. While identifying a direct association such as a difference in expression levels between case and control groups links genes to disease mechanisms, uncovering an indirect association in the form of a network structure may help reveal the underlying functional module associated with the disease under scrutiny. This paper presents a method to improve the biological relevance in functional module identification from the gene expression microarray data by enhancing the structure of a weighted gene co-expression network using minimum spanning tree. The enhanced network, which is called a backbone network, contains only the essential structural information to represent the gene co-expression network. The entire backbone network is decoupled into a number of coherent sub-networks, and then the functional modules are reconstructed from these sub-networks to ensure minimum redundancy. The method was tested with a simulated gene expression dataset and case-control expression datasets of autism spectrum disorder and colorectal cancer studies. The results indicate that the proposed method can accurately identify clusters in the simulated dataset, and the functional modules of the backbone network are more biologically relevant than those obtained from the original approach.

Journal ArticleDOI
TL;DR: Here, it is demonstrated how to apply a simple software testing technique, called Metamorphic Testing, to alleviate the oracle problem in testing phylogenetic inference programs, and found that metamorphic testing can detect failures effectively in faulty phylogenetics inference programs with both types of test inputs.
Abstract: Many phylogenetic inference programs are available to infer evolutionary relationships among taxa using aligned sequences of characters, typically DNA or amino acids. These programs are often used to infer the evolutionary history of species. However, in most cases it is impossible to systematically verify the correctness of the tree returned by these programs, as the correct evolutionary history is generally unknown and unknowable. In addition, it is nearly impossible to verify whether any non-trivial tree is correct in accordance to the specification of the often complicated search and scoring algorithms. This difficulty is known as the oracle problem of software testing: there is no oracle that we can use to verify the correctness of the returned tree. This makes it very challenging to test the correctness of any phylogenetic inference programs. Here, we demonstrate how to apply a simple software testing technique, called Metamorphic Testing, to alleviate the oracle problem in testing phylogenetic inference programs. We have used both real and randomly generated test inputs to evaluate the effectiveness of metamorphic testing, and found that metamorphic testing can detect failures effectively in faulty phylogenetic inference programs with both types of test inputs.

Journal ArticleDOI
TL;DR: Common feature pharmacophore models using three-dimensional structural information of robotnikinin, an inhibitor of the Shh signaling pathway, and its analogs are generated and a candidate inhibitor was selected as a potential lead to be employed in future Shh inhibitor design.
Abstract: Sonic hedgehog (Shh) plays an important role in the activation of Shh signaling pathway that regulates preservation and rebirth of adult tissues. An abnormal activation of this pathway has been identified in hyperplasia and various tumorigenesis. Hence the inhibition of this pathway using a Shh inhibitor might be an efficient way to treat a wide range of malignancies. This study was done in order to develop a lead chemical candidate that has an inhibitory function in the Shh signaling pathway. We have generated common feature pharmacophore models using three-dimensional (3D) structural information of robotnikinin, an inhibitor of the Shh signaling pathway, and its analogs. These models have been validated with fit values of robotnikinin and its analogs, and the best model was used as a 3D structural query to screen chemical databases. The hit compounds resulted from the screening docked into a proposed binding site of the Shh named pseudo-active site. Molecular dynamics (MD) simulations were performed to investigate detailed binding modes and molecular interactions between the hit compounds and functional residues of the pseudo-active site. The results of the MD simulation analyses revealed that the hit compounds can bind the pseudo-active site with high affinity than robotnikinin. As a result of this study, a candidate inhibitor (GK 03795) was selected as a potential lead to be employed in future Shh inhibitor design.

Journal ArticleDOI
TL;DR: This paper reviews and compares two recently developed and publicly available software packages, RegStatGel and Pinnacle, for analyzing 2D gel images and concludes that RegstatGel is much better in terms of spot detection and matching.
Abstract: One of the key limitations for proteomic studies using two-dimensional (2D) gel is the lack of automatic, fast, robust, and reliable methods for detecting, matching, and quantifying protein spots. Although there are commercial software packages for 2D gel image analysis, extensive human intervention is still needed for spot detection and matching, which is time-consuming and error-prone. Moreover, the commercial software packages are usually expensive and non-open source. Thus, it is very beneficial for researchers to have free software that is fast, fully automatic, and robust. In this paper, we review and compare two recently developed and publicly available software packages, RegStatGel and Pinnacle, for analyzing 2D gel images. These two software packages share some common features and also have some fundamental difference in the aspects of spot detection and quantification. Based on our experience, RegStatGel is much better in terms of spot detection and matching. It also contains more advanced statistical tools and is more user-friendly. In contrast, Pinnacle is quite sensitive to background noise and relies on external statistical software packages for statistical analysis.

Journal ArticleDOI
TL;DR: A new framework, an optimized implementation of a random forest classifier is proposed, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy.
Abstract: Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high-dimensional microarray datasets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets, as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.

Journal ArticleDOI
TL;DR: A subquadratic running time algorithm capable of computing an alignment that optimizes one of the most widely used measures of protein structure similarity, defined as the number of pairs of residues in two proteins that can be superimposed under a predefined distance cutoff.
Abstract: The problem of finding an optimal structural alignment for a pair of superimposed proteins is often amenable to the Smith–Waterman dynamic programming algorithm, which runs in time proportional to ...

Journal ArticleDOI
TL;DR: Examining various physico-chemical properties that show statistically significant differences between the beta strands located at the oligomeric interfaces compared to the non-oligomeric strands may be useful in 3D structure prediction of TMBs.
Abstract: We present BTMX (Beta barrel TransMembrane eXposure), a computational method to predict the exposure status (i.e. exposed to the bilayer or hidden in the protein structure) of transmembrane residues in transmembrane beta barrel proteins (TMBs). BTMX predicts the exposure status of known TM residues with an accuracy of 84.2% over 2,225 residues and provides a confidence score for all predictions. Predictions made are in concert with the fact that hydrophobic residues tend to be more exposed to the bilayer. The biological relevance of the input parameters is also discussed. The highest prediction accuracy is obtained when a sliding window comprising three residues with similar C(α)-C(β) vector orientations is employed. The prediction accuracy of the BTMX method on a separate unseen non-redundant test dataset is 78.1%. By employing out-pointing residues that are exposed to the bilayer, we have identified various physico-chemical properties that show statistically significant differences between the beta strands located at the oligomeric interfaces compared to the non-oligomeric strands. The BTMX web server generates colored, annotated snake-plots as part of the prediction results and is available under the BTMX tab at http://service.bioinformatik.uni-saarland.de/tmx-site/. Exposure status prediction of TMB residues may be useful in 3D structure prediction of TMBs.

Journal ArticleDOI
TL;DR: It is found that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow, which will affect E-value guided annotation decisions in an automated mode.
Abstract: E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.

Journal ArticleDOI
TL;DR: A new graph-based semi-supervised classification algorithm Sequential Linear Neighborhood Propagation (SLNP) is introduced, which addresses the problem of the classification of partially labeled protein interaction networks.
Abstract: Predicting protein function is one of the most challenging problems of the post-genomic era. The development of experimental methods for genome scale analysis of molecular interaction networks has provided new approaches to inferring protein function. In this paper we introduce a new graph-based semi-supervised classification algorithm Sequential Linear Neighborhood Propagation (SLNP), which addresses the problem of the classification of partially labeled protein interaction networks. The proposed SLNP first constructs a sequence of node sets according to their shortest distance to the labeled nodes, and then predicts the function of the unlabel proteins from the set closer to labeled one, using Linear Neighborhood Propagation. Its performance is assessed on the Saccharomyces cerevisiae PPI network data sets, with good results compared with three current state-of-the-art algorithms, especially in settings where only a small fraction of the proteins are labeled.

Journal ArticleDOI
TL;DR: A Support Vector Machine (SVM)-based method for protein structural class prediction that uses features derived from the predicted secondary structure and predicted burial information of amino acid residues and it is revealed that the combination of secondary structural content, secondary structural and solvent accessibility state frequencies of amino acids gave rise to the best leave-one-out cross- validation accuracy.
Abstract: The knowledge collated from the known protein structures has revealed that the proteins are usually folded into the four structural classes: all-α, all-β, α/β and α + β A number of methods have been proposed to predict the protein's structural class from its primary structure; however, it has been observed that these methods fail or perform poorly in the cases of distantly related sequences In this paper, we propose a new method for protein structural class prediction using low homology (twilight-zone) protein sequences dataset Since protein structural class prediction is a typical classification problem, we have developed a Support Vector Machine (SVM)-based method for protein structural class prediction that uses features derived from the predicted secondary structure and predicted burial information of amino acid residues The examination of different individual as well as feature combinations revealed that the combination of secondary structural content, secondary structural and solvent accessibility state frequencies of amino acids gave rise to the best leave-one-out cross-validation accuracy of ~81% which is comparable to the best accuracy reported in the literature so far

Journal ArticleDOI
TL;DR: A flexible methodology for the in silico prediction of genes associated with diseases combining the use of available tools for gene enrichment analysis, gene network generation and gene prioritization is described.
Abstract: Experimental techniques for the identification of genes associated with diseases are expensive and have certain limitations. In this scenario, computational methods are useful tools to identify lists of promising genes for further experimental verification. This paper describes a flexible methodology for the in silico prediction of genes associated with diseases combining the use of available tools for gene enrichment analysis, gene network generation and gene prioritization. A set of reference genes, with a known association to a disease, is used as bait to extract candidate genes from molecular interaction networks and enriched pathways. In a second step, prioritization methods are applied to evaluate the similarities between previously selected candidates and the set of reference genes. The top genes obtained by these programs are grouped into a single list sorted by the number of methods that have selected each gene. As a proof of concept, top genes reported a few years ago in SzGene and AlzGene databases were used as references to predict genes associated to schizophrenia and Alzheimer's disease, respectively. In both cases, we were able to predict a statistically significant amount of genes belonging to the updated lists.

Journal ArticleDOI
TL;DR: A new approach is presented to deal with lateral gene transfers that combines the Neighbor-Net algorithm for computing phylogenetic networks with the Minimum Contradiction method, and is illustrated by applying it to a distance matrix for Archaea, Bacteria, and Eukaryota.
Abstract: Identifying lateral gene transfers is an important problem in evolutionary biology. Under a simple model of evolution, the expected values of an evolutionary distance matrix describing a phylogenetic tree fulfill the so-called Kalmanson inequalities. The Minimum Contradiction method for identifying lateral gene transfers exploits the fact that lateral transfers may generate large deviations from the Kalmanson inequalities. Here a new approach is presented to deal with such cases that combines the Neighbor-Net algorithm for computing phylogenetic networks with the Minimum Contradiction method. A subset of taxa, prescribed using Neighbor-Net, is obtained by measuring how closely the Kalmanson inequalities are fulfilled by each taxon. A criterion is then used to identify the taxa, possibly involved in a lateral transfer between nonconsecutive taxa. We illustrate the utility of the new approach by applying it to a distance matrix for Archaea, Bacteria, and Eukaryota.

Journal ArticleDOI
TL;DR: A method to automate the basic steps in designing an SVM that improves the accuracy of such classification and promises to be generally useful in automating the analysis of biological sequences is proposed.
Abstract: Hypersensitive (HS) sites in genomic sequences are reliable markers of DNA regulatory regions that control gene expression. Annotation of regulatory regions is important in understanding phenotypical differences among cells and diseases linked to pathologies in protein expression. Several computational techniques are devoted to mapping out regulatory regions in DNA by initially identifying HS sequences. Statistical learning techniques like Support Vector Machines (SVM), for instance, are employed to classify DNA sequences as HS or non-HS. This paper proposes a method to automate the basic steps in designing an SVM that improves the accuracy of such classification. The method proceeds in two stages and makes use of evolutionary algorithms. An evolutionary algorithm first designs optimal sequence motifs to associate explicit discriminating feature vectors with input DNA sequences. A second evolutionary algorithm then designs SVM kernel functions and parameters that optimally separate the HS and non-HS classes. Results show that this two-stage method significantly improves SVM classification accuracy. The method promises to be generally useful in automating the analysis of biological sequences, and we post its source code on our website.

Journal ArticleDOI
TL;DR: A mechanical model of cell motion was developed that reproduced the behaviour of cells in 2-dimensional culture and showed that cells were best modelled with a degree of stickiness just under the critical threshold level, allowing fluidlike motion while maintaining cohesiveness across the population.
Abstract: A mechanical model of cell motion was developed that reproduced the behaviour of cells in 2-dimensional culture. Cell adhesion was modelled with inter-cellular cross-links that attached for different times giving a range of adhesion strength. Simulations revealed an adhesion threshold below which cell motion was almost unaffected and above which cells moved as if permanently linked. Comparing simulated cell clusters (with known connections) to calculated clusters (based only on distance) showed that the calculated clusters did not correspond well across the full size range from small to big clusters. The radial distribution function of the cells was found to be a better measure, giving a good correlation with the known cell linkage throughout the simulation run. This analysis showed that cells were best modelled with a degree of stickiness just under the critical threshold level. This allowed fluidlike motion while maintaining cohesiveness across the population.

Journal ArticleDOI
TL;DR: A heuristic algorithm, the core of which is an integer linear program (ILP) using the system of linear equations over Galois field GF(2), which can detect and locate genotyping errors that cannot be detected by simply checking the Mendelian law of inheritance.
Abstract: Inferring the haplotypes of the members of a pedigree from their genotypes has been extensively studied. However, most studies do not consider genotyping errors and de novo mutations. In this paper, we study how to infer haplotypes from genotype data that may contain genotyping errors, de novo mutations, and missing alleles. We assume that there are no recombinants in the genotype data, which is usually true for tightly linked markers. We introduce a combinatorial optimization problem, called haplotype configuration with mutations and errors (HCME), which calls for haplotype configurations consistent with the given genotypes that incur no recombinants and require the minimum number of mutations and errors. HCME is NP-hard. To solve the problem, we propose a heuristic algorithm, the core of which is an integer linear program (ILP) using the system of linear equations over Galois field GF(2). Our algorithm can detect and locate genotyping errors that cannot be detected by simply checking the Mendelian law of inheritance. The algorithm also offers error correction in genotypes/haplotypes rather than just detecting inconsistencies and deleting the involved loci. Our experimental results show that the algorithm can infer haplotypes with a very high accuracy and recover 65%–94% of genotyping errors depending on the pedigree topology.