scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2007"


Journal ArticleDOI
TL;DR: Several approaches beyond traditional sequence similarity that utilize the overwhelmingly large amounts of available data for computational function prediction, including structure-, association, interaction, interaction (cellular context)-, process, and proteomics-experiment-based methods are categorized.
Abstract: Function prediction of uncharacterized protein sequences generated by genome projects has emerged as an important focus for computational biology. We have categorized several approaches beyond traditional sequence similarity that utilize the overwhelmingly large amounts of available data for computational function prediction, including structure-, association (genomic context)-, interaction (cellular context)-, process (metabolic context)-, and proteomics-experiment-based methods. Because they incorporate structural and experimental data that is not used in sequence-based methods, they can provide additional accuracy and reliability to protein function prediction. Here, first we review the definition of protein function. Then the recent developments of these methods are introduced with special focus on the type of predictions that can be made. The need for further development of comprehensive systems biology techniques that can utilize the ever-increasing data presented by the genomics and proteomics communities is emphasized. For the readers' convenience, tables of useful online resources in each category are included. The role of computational scientists in the near future of biological research and the interplay between computational and experimental biology are also addressed.

103 citations


Journal ArticleDOI
TL;DR: Statistical procedures are demonstrated that superimpose expression data onto the transcription regulation network mined from scientific literature and aim at selecting transcription regulators with significant patterns of expression changes downstream.
Abstract: Microarray-based characterization of tissues, cellular and disease states, and environmental condition and treatment responses provides genome-wide snapshots containing large amounts of invaluable information. However, the lack of inherent structure within the data and strong noise make extracting and interpreting this information and formulating and prioritizing domain relevant hypotheses difficult tasks. Integration with different types of biological data is required to place the expression measurements into a biologically meaningful context. A few approaches in microarray data interpretation are discussed with the emphasis on the use of molecular network information. Statistical procedures are demonstrated that superimpose expression data onto the transcription regulation network mined from scientific literature and aim at selecting transcription regulators with significant patterns of expression changes downstream. Tests are suggested that take into account network topology and signs of transcription regulation effects. The approaches are illustrated using two different expression datasets, the performance is compared, and biological relevance of the predictions is discussed.

68 citations


Journal ArticleDOI
TL;DR: The strategy for approaching the relationship between SNPs and disease, the results of benchmarking the approach and the techniques developed in the laboratory have allowed fast and automated sequence-structure homology recognition to identify templates and to perform comparative modeling.
Abstract: The prediction of the effects of nonsynonymous single nucleotide polymorphisms (nsSNPs) on function depends critically on exploiting all information available on the three-dimensional structures of...

65 citations


Journal ArticleDOI
TL;DR: The proposed approach is a method of kinetic data approximation in view of additional data on structure functional features of molecular genetic systems and actually does not demand knowledge of their detailed mechanisms.
Abstract: Development of an in silico cell is an urgent task of systems biology. The core of this cell should consist of mathematical models of intracellular events, including enzymatic reactions and control of gene expression. For example, the minimal model of the E. coli cell should include description of about one thousand enzymatic reactions and regulation of expression of approximately the same number of genes. In many cases detailed mechanisms of molecular processes are not known. In this study, we propose a generalized Hill function method for modeling molecular events. The proposed approach is a method of kinetic data approximation in view of additional data on structure functional features of molecular genetic systems and actually does not demand knowledge of their detailed mechanisms. Generalized Hill function models of an enzymatic reaction catalyzed by the tryptophan-sensitive 3-deoxy-d-arabino-heptulosonate-7-phosphate synthase and the cydAB operon expression regulation are presented.

47 citations


Journal ArticleDOI
TL;DR: The automatic information retrieval method developed to recover texts mentioning SAPs was found to be efficient in retrieving new references on known polymorphisms and nonstandard mutation nomenclature and sequence positional correction is necessary to retrieve a significant number of relevant articles.
Abstract: The UniProt/Swiss-Prot Knowledgebase records about 30,500 variants in 5,664 proteins (Release 52.2). Most of these variants are manually curated single amino acid polymorphisms (SAPs) with references to the literature. In order to keep the list of published documents related to SAPs up to date, an automatic information retrieval method is developed to recover texts mentioning SAPs. The method is based on the use of regular expressions (patterns) and rules for the detection and validation of mutations. When evaluated using a corpus of 9,820 PubMed references, the precision of the retrieval was determined to be 89.5% over all variants. It was also found that the use of nonstandard mutation nomenclature and sequence positional correction is necessary to retrieve a significant number of relevant articles. The method was applied to the 5,664 proteins with variants. This was performed by first submitting a PubMed query to retrieve articles using gene or protein names and a list of mutation-related keywords; the SAP detection procedure was then used to recover relevant documents. The method was found to be efficient in retrieving new references on known polymorphisms. New references on known SAPs will be rendered accessible to the public via the Swiss-Prot variant pages.

39 citations


Journal ArticleDOI
TL;DR: A new method for identifying these common tumor progression pathways by applying phylogeny inference algorithms to single-cell assays, taking advantage of information on tumor heterogeneity lost to prior microarray-based approaches is developed.
Abstract: Studies of gene expression in cancerous tumors have revealed that tumors presenting indistinguishable symptoms in the clinic can be substantially different entities at the molecular level. The ability to distinguish between these genetically distinct cancers will make possible more accurate prognoses and more finely targeted therapeutics, provided we can characterize commonly occurring cancer sub-types and the specific molecular abnormalities that produce them. We develop a new method for identifying these common tumor progression pathways by applying phylogeny inference algorithms to single-cell assays, taking advantage of information on tumor heterogeneity lost to prior microarray-based approaches. We combine this approach with expectation maximization to infer unknown parameters used in the phylogeny construction. We further develop new algorithms to merge inferred trees across different assays. We validate the expectation maximization method on simulated data and demonstrate the combined approach on a set of fluorescent in situ hybridization (FISH) data measuring cell-by-cell gene and chromosome copy numbers in a large sample of breast cancers. The results further validate the proposed computational methods by showing consistency with several previous findings on these cancers and provide novel insights into the mechanisms of tumor progression in these patients.

39 citations


Journal ArticleDOI
TL;DR: Simulation studies and analysis of biological data confirm the conjecture that the N-statistic is a much better choice for multivariate significance testing within the framework of the GSEA.
Abstract: A test-statistic typically employed in the gene set enrichment analysis (GSEA) prevents this method from being genuinely multivariate. In particular, this statistic is insensitive to changes in the correlation structure of the gene sets of interest. The present paper considers the utility of an alternative test-statistic in designing the confirmatory component of the GSEA. This statistic is based on a pertinent distance between joint distributions of expression levels of genes included in the set of interest. The null distribution of the proposed test-statistic, known as the multivariate N-statistic, is obtained by permuting group labels. Our simulation studies and analysis of biological data confirm the conjecture that the N-statistic is a much better choice for multivariate significance testing within the framework of the GSEA. We also discuss some other aspects of the GSEA paradigm and suggest new avenues for future research.

38 citations


Journal ArticleDOI
TL;DR: An automated system that automatically extracts mutation-gene pairs from Medline abstracts for a disease query and overcomes the problems of information retrieval from public resources and reduces the time required to access relevant information, while preserving the accuracy of retrieved information is described.
Abstract: To have a better understanding of the mechanisms of disease development, knowledge of mutations and the genes on which the mutations occur is of crucial importance. Information on disease-related mutations can be accessed through public databases or biomedical literature sources. However, information retrieval from such resources can be problematic because of two reasons: manually created databases are usually incomplete and not up to date, and reading through a vast amount of publicly available biomedical documents is very time-consuming. In this paper, we describe an automated system, MuGeX (Mutation Gene eXtractor), that automatically extracts mutation-gene pairs from Medline abstracts for a disease query. Our system is tested on a corpus that consists of 231 Medline abstracts. While recall for mutation detection alone is 85.9%, precision is 95.9%. For extraction of mutation-gene pairs, we focus on Alzheimer's disease. The recall for mutation-gene pair identification is estimated at 91.3%, and precision is estimated at 88.9%. With automatic extraction techniques, MuGeX overcomes the problems of information retrieval from public resources and reduces the time required to access relevant information, while preserving the accuracy of retrieved information.

36 citations


Journal ArticleDOI
TL;DR: This work proposes a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data, and builds an optimized C++ implementation of the two-pass DTWimpute algorithm.
Abstract: Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.

30 citations


Journal ArticleDOI
TL;DR: A comprehensive framework for the systematic analysis of mutation extraction systems, precisely defining tasks and corresponding evaluation metrics, that will allow a comparison of existing and future applications.
Abstract: The development of text analysis systems targeting the extraction of information about mutations from research publications is an emergent topic in biomedical research. Current systems differ in both scope and approach, thus preventing a meaningful comparison of their performance and therefore possible synergies. To overcome this evaluation bottleneck, we developed a comprehensive framework for the systematic analysis of mutation extraction systems, precisely defining tasks and corresponding evaluation metrics, that will allow a comparison of existing and future applications.

29 citations


Journal ArticleDOI
TL;DR: A new method based on model-checking techniques and symbolic execution to extract constraints on parameters leading to dynamics coherent with known behaviors is introduced.
Abstract: Understanding the functioning of genetic regulatory networks supposes a modeling of biological processes in order to simulate behaviors and to reason on the model. Unfortunately, the modeling task is confronted to incomplete knowledge about the system. To deal with this problem we propose a methodology that uses the qualitative approach developed by Thomas. A symbolic transition system can represent the set of all possible models in a concise and symbolic way. We introduce a new method based on model-checking techniques and symbolic execution to extract constraints on parameters leading to dynamics coherent with known behaviors. Our method allows us to efficiently respond to two kinds of questions: is there any model coherent with a certain hypothetic behavior? Are there behaviors common to all selected models? The first question is illustrated with the example of the mucus production in Pseudomonas aeruginosa while the second one is illustrated with the example of immunity control in bacteriophage lambda.

Journal ArticleDOI
TL;DR: A workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations -- tasks which are well known to be tedious, time-consuming, complex, and error-prone is demonstrated.
Abstract: Rich information on point mutation studies is scattered across heterogeneous data sources. This paper presents an automated workflow for mining mutation annotations from full-text biomedical litera...

Journal ArticleDOI
Jake Y. Chen1, Zhong Yan1, Changyu Shen1, Dawn P. G. Fitzpatrick1, Mu Wang1 
TL;DR: It is shown that molecular regulation of cell differentiation and development caused by responses to proteome-wide stress as a key signature to the acquired drug resistance.
Abstract: Cisplatin-induced drug resistance is known to involve a complex set of cellular changes whose molecular mechanism details remain unclear. In this study, we developed a systems biology approach to examine proteomics- and network-level changes between cisplatin-resistant and cisplatin-sensitive cell lines. This approach involves experimental investigation of differential proteomics profiles and computational study of activated enriched proteins, protein interactions, and protein interaction networks. Our experimental platform is based on a Label-free liquid Chromatography/mass spectrometry proteomics platform. Our computational methods start with an initial list of 119 differentially expressed proteins. We expanded these proteins into a cisplatin-resistant activated sub-network using a database of human protein-protein interactions. An examination of network topology features revealed the activated responses in the network are closely coupled. By examining sub-network proteins using gene ontology categories, we found significant enrichment of proton-transporting ATPase and ATP synthase complexes activities in cisplatin-resistant cells in the form of cooperative down-regulations. Using two-dimensional visualization matrixes, we further found significant cascading of endogenous, abiotic, and stress-related signals. Using a visual representation of activated protein categorical sub-networks, we showed that molecular regulation of cell differentiation and development caused by responses to proteome-wide stress as a key signature to the acquired drug resistance.

Journal ArticleDOI
TL;DR: This paper describes and implements a new algorithm, the Stochastic EM-type Algorithm for Motif-finding (SEAM), and redesigns and implements the EM-based motif-finding algorithm called deterministic EM (DEM) for comparison with SEAM, its stochastic counterpart.
Abstract: Position weight matrix-based statistical modeling for the identification and characterization of motif sites in a set of unaligned biopolymer sequences is presented. This paper describes and implements a new algorithm, the Stochastic EM-type Algorithm for Motif-finding (SEAM), and redesigns and implements the EM-based motif-finding algorithm called deterministic EM (DEM) for comparison with SEAM, its stochastic counterpart. The gold standard example, cyclic adenosine monophosphate receptor protein (CRP) binding sequences, together with other biological sequences, is used to illustrate the performance of the new algorithm and compare it with other popular motif-finding programs. The convergence of the new algorithm is shown by simulation. The in silico experiments using simulated and biological examples illustrate the power and robustness of the new algorithm SEAM in de novo motif discovery.

Journal ArticleDOI
TL;DR: The study indicates that the discrimination ability of the stem kernel is strong compared with conventional methods, and the potential application is demonstrated by the detection of remotely homologous RNA families in terms of secondary structures.
Abstract: Several computational methods based on stochastic context-free grammars have been developed for modeling and analyzing functional RNA sequences. These grammatical methods have succeeded in modeling typical secondary structures of RNA, and are used for structural alignment of RNA sequences. However, such stochastic models cannot sufficiently discriminate member sequences of an RNA family from nonmembers and hence detect noncoding RNA regions from genome sequences. A novel kernel function, stem kernel, for the discrimination and detection of functional RNA sequences using support vector machines (SVMs) is proposed. The stem kernel is a natural extension of the string kernel, specifically the all-subsequences kernel, and is tailored to measure the similarity of two RNA sequences from the viewpoint of secondary structures. The stem kernel examines all possible common base pairs and stem structures of arbitrary lengths, including pseudoknots between two RNA sequences, and calculates the inner product of common stem structure counts. An efficient algorithm is developed to calculate the stem kernels based on dynamic programming. The stem kernels are then applied to discriminate members of an RNA family from nonmembers using SVMs. The study indicates that the discrimination ability of the stem kernel is strong compared with conventional methods. Furthermore, the potential application of the stem kernel is demonstrated by the detection of remotely homologous RNA families in terms of secondary structures. This is because the string kernel is proven to work for the remote homology detection of protein sequences. These experimental results have convinced us to apply the stem kernel in order to find novel RNA families from genome sequences.

Journal ArticleDOI
TL;DR: The study was successful in listing out potential drug targets from the S. pneumoniae proteome involved in vital aspects of the pathogen's metabolism, persistence, virulence and cell wall biosynthesis, and can be extended to other pathogens of clinical interest.
Abstract: The emergence of multidrug resistant varieties of Streptococcus pneumoniae (S. pneumoniae) has led to a search for novel drug targets. An in silico comparative analysis of metabolic pathways of the host Homo sapiens (H. sapiens) and the pathogen S. pneumoniae have been performed. Enzymes from the biochemical pathways of S. pneumoniae from the KEGG metabolic pathway database were compared with proteins from the host H. sapiens, by performing a BLASTp search against the non-redundant database restricted to the H. sapiens subset. The e-value threshold cutoff was set to 0.005. Enzymes, which do not show similarity to any of the host proteins, below this threshold, were filtered out as potential drug targets. Five pathways unique to the pathogen S. pneumoniae when compared to the host H. sapiens have been identified. Potential drug targets from these pathways could be useful for the discovery of broad-spectrum drugs. Potential drug targets were also identified from pathways related to lipid metabolism, carbohydrate metabolism, amino acid metabolism, energy metabolism, vitamin and cofactor biosynthetic pathways and nucleotide metabolism. Of the 161 distinct targets identified from these pathways, many are in various stages of progress at the Microbial Genome Database. However, 44 of the targets are new and can be considered for rational drug design. The study was successful in listing out potential drug targets from the S. pneumoniae proteome involved in vital aspects of the pathogen's metabolism, persistence, virulence and cell wall biosynthesis. This systematic evaluation of metabolic pathways of host and pathogen through reliable and conventional bioinformatics approach can be extended to other pathogens of clinical interest.

Journal ArticleDOI
TL;DR: The recent developments of available software methods to analyze SNPs in relation to complex diseases are reviewed with emphasis on the type of predictions on protein structure and functions that can be made.
Abstract: Bioinformatics is the use of informatics tools and techniques in the study of molecular biology, genetic, or clinical data The field of bioinformatics has expanded tremendously to cope with the large expansion of information generated by the mouse and human genome projects, as newer generations of computers that are much more powerful have emerged in the commercial market It is now possible to employ the computing hardware and software at hand to generate novel methodologies in order to link data across the different databanks generated by these international projects and derive clinical and biological relevance from all of the information gathered The ultimate goal would be to develop a computer program that can provide information correlating genes, their single nucleotide polymorphisms (SNPs), and the possible structural and functional effects on the encoded proteins with relation to known information on complex diseases with great ease and speed Here, the recent developments of available software methods to analyze SNPs in relation to complex diseases are reviewed with emphasis on the type of predictions on protein structure and functions that can be made The need for further development of comprehensive bioinformatics tools that can cope with information generated by the genomics communities is emphasized

Journal ArticleDOI
TL;DR: This report shows interactive links between virtual and experimental approaches in a total pipeline "from gene to drug" and using Surface Plasmon Resonance technology for experimentally assessing PPI using HIV-1 protease and bacterial L-asparaginase.
Abstract: Protein–protein and protein–ligand interactions play a central role in biochemical reactions, and understanding these processes is an important task in different fields of biomedical science and drug discovery. Proteins often work in complex assemblies of several macromolecules and small ligands. The structural and functional description of protein–protein interactions (PPI) is very important for basic-, as well as applied research. The interface areas of protein complexes have unique structure and properties, so PPI represent prospective targets for a new generation of drugs. One of the key targets of PPI inhibitors are oligomeric enzymes. This report shows interactive links between virtual and experimental approaches in a total pipeline "from gene to drug" and using Surface Plasmon Resonance technology for experimentally assessing PPI. Our research is conducted on two oligomeric enzymes — HIV-1 protease (HIVp) (homo-dimer) and bacterial L-asparaginase (homo-tetramer). Using methods of molecular modeling and computational alanine scanning we obtained structural and functional description of PPI in these two enzymes. We also presented a real example of application of integral approach in searching inhibitors of HIVp dimerization — from virtual database mining up to experimental testing of lead compounds.

Journal ArticleDOI
TL;DR: Experimental evidences that two internal promoters are recognized by bacterial RNA polymerase are provided, one of them is located within the hns coding sequence and may initiate synthesis of RNA from the antisense strand.
Abstract: Mapping of putative promoters within the entire genome of Escherichia coli (E. coli) by means of pattern-recognition software PlatProm revealed several thousand of sites having high probability to perform promoter function. Along with the expected promoters located upstream of coding sequences, PlatProm identified more than a thousand potential promoters for antisense transcription and several hundred very similar signals within coding sequences having the same direction with the genes. Since recently developed ChIP–chip technology also testified the presence of intragenic RNA polymerase binding sites, such distribution of putative promoters is likely to be a general biological phenomenon reflecting yet undiscovered regulatory events. Here, we provide experimental evidences that two internal promoters are recognized by bacterial RNA polymerase. One of them is located within the hns coding sequence and may initiate synthesis of RNA from the antisense strand. Another one is found within the overlapping genes htgA/yaaW and may control the production of a shortened mRNA or an RNA-product complementary to mRNA of yaaW. Both RNA-products can form secondary structures with free energies of folding close to those of small regulatory RNAs (sRNAs) of the same length. Folding propensity of known sRNAs was further compared with that of antisense RNAs (aRNAs), predicted in E. coli as well as in Salmonella typhimurium (S. typhimurium). Slightly lower stability observed for aRNAs assumes that their structural compactness may be less significant for biological function.

Journal ArticleDOI
TL;DR: C-spheres can be used to accelerate point-based geometric and chemical comparison algorithms, maintaining accuracy while reducing runtime, and it is demonstrated that the placement of C- Spheres can significantly affect the number of TPs and FPs identified by a cavity-aware motif.
Abstract: Algorithms for geometric and chemical comparison of protein substructure can be useful for many applications in protein function prediction. These motif matching algorithms identify matches of geometric and chemical similarity between well-studied functional sites, motifs, and substructures of functionally uncharacterized proteins, targets. For the purpose of function prediction, the accuracy of motif matching algorithms can be evaluated with the number of statistically significant matches to functionally related proteins, true positives (TPs), and the number of statistically insignificant matches to functionally unrelated proteins, false positives (FPs). Our earlier work developed cavity-aware motifs which use motif points to represent functionally significant atoms and C-spheres to represent functionally significant volumes. We observed that cavity-aware motifs match significantly fewer FPs than matches containing only motif points. We also observed that high-impact C-spheres, which significantly contribute to the reduction of FPs, can be isolated automatically with a technique we call Cavity Scaling. This paper extends our earlier work by demonstrating that C-spheres can be used to accelerate point-based geometric and chemical comparison algorithms, maintaining accuracy while reducing runtime. We also demonstrate that the placement of C-spheres can significantly affect the number of TPs and FPs identified by a cavity-aware motif. While the optimal placement of C-spheres remains a difficult open problem, we compared two logical placement strategies to better understand C-sphere placement.

Journal ArticleDOI
TL;DR: The ProMiner system developed for the recognition and normalization of gene and protein names with a conditional random field (CRF)-based recognition of variation terms in biomedical text is integrated and linked to Single Nucleotide Polymorphism database (dbSNP) entries.
Abstract: The influence of genetic variations on diseases or cellular processes is the main focus of many investigations, and results of biomedical studies are often only accessible through scientific publications. Automatic extraction of this information requires recognition of the gene names and the accompanying allelic variant information. In a previous work, the OSIRIS system for the detection of allelic variation in text based on a query expansion approach was communicated. Challenges associated with this system are the relatively low recall for variation mentions and gene name recognition. To tackle this challenge, we integrate the ProMiner system developed for the recognition and normalization of gene and protein names with a conditional random field (CRF)-based recognition of variation terms in biomedical text. Following the newly developed normalization of variation entities, we can link textual entities to Single Nucleotide Polymorphism database (dbSNP) entries. The performance of this novel approach is evaluated, and improved results in comparison to state-of-the-art systems are reported.

Journal ArticleDOI
TL;DR: Two previous quasi-equilibrium models of transcriptional regulation, derived using the methods of equilibrium statistical mechanics, are rederive and extended, and circumstances under which they can be approximated at each transcription complex by feed-forward artificial neural network (ANN) models are demonstrated.
Abstract: Mechanistic models for transcriptional regulation are derived using the methods of equilibrium statistical mechanics, to model equilibrating processes that occur at a fast time scale. These processes regulate slower changes in the synthesis and expression of transcription factors that feed back and cooperatively regulate transcription, forming a gene regulation network (GRN). We rederive and extend two previous quasi-equilibrium models of transcriptional regulation, and demonstrate circumstances under which they can be approximated at each transcription complex by feed-forward artificial neural network (ANN) models. A single-level mechanistic model can be approximated by a successfully applied phenomenological model of GRNs which is based on single-layer analog-valued ANNs. A two-level hierarchical mechanistic model, with separate activation states for modules and for the whole transcription complex, can be approximated by a two-layer feed-forward ANN in several related ways. The sufficient conditions demonstrated for the ANN approximations correspond biologically to large numbers of binding sites each of which have a small effect. A further extension to the single-level and two-level models allows one-dimensional chains of overlapping and/or energetically interacting binding sites within a module. Partition functions for these models can be constructed from stylized diagrams that indicate energetic and logical interactions between binary-valued state variables. All parameters in the mechanistic models, including the two approximations, can in principle be related to experimentally measurable free energy differences, among other observables.

Journal ArticleDOI
TL;DR: In this paper, a cross-platform normalization method was used to make these data comparable, and three discriminative gene lists between the large vessel and the microvascular endothelial cells were achieved by SAM (significant analysis of microarrays), PAM (prediction analysis for microarray), and a combination of SAM and PAM lists.
Abstract: Since the available microarray data of BOEC (human blood outgrowth endothelial cells), large vessel, and microvascular endothelial cells were from two different platforms, a working cross-platform normalization method was needed to make these data comparable. With six HUVEC (human umbilical vein endothelial cells) samples hybridized on two-channel cDNA arrays and six HUVEC samples on Affymetrix arrays, 64 possible combinations of a three-step normalization procedure were investigated to search for the best normalization method, which was selected, based on two criteria measuring the extent to which expression profiles of biological samples of the same cell type arrayed on two platforms were indistinguishable. Next, three discriminative gene lists between the large vessel and the microvascular endothelial cells were achieved by SAM (significant analysis of microarrays), PAM (prediction analysis for microarrays), and a combination of SAM and PAM lists. The final discriminative gene list was selected by SVM (support vector machine). Based on this discriminative gene list, SVM classification analysis with best tuning parameters and 10,000 times of validations showed that BOEC were far from large vessel cells, they either formed their own class, or fell into the microvascular class. Based on all the common genes between the two platforms, SVM analysis further confirmed this conclusion.

Journal ArticleDOI
TL;DR: An approach for automatically developing collections of regular expressions to drive high-performance concept recognition systems with minimal human interaction is presented, and MutationFinder, a system for automatically extracting mentions of point mutations from the text is developed.
Abstract: The primary biomedical literature is being generated at an unprecedented rate, and researchers cannot keep abreast of new developments in their fields. Biomedical natural language processing is being developed to address this issue, but building reliable systems often requires many expert-hours. We present an approach for automatically developing collections of regular expressions to drive high-performance concept recognition systems with minimal human interaction. We applied our approach to develop MutationFinder, a system for automatically extracting mentions of point mutations from the text. MutationFinder achieves performance equivalent to or better than manually developed mutation recognition systems, but the generation of its 759 patterns has required only 5.5 expert-hours. We also discuss the development and evaluation of our recently published high-quality, human-annotated gold standard corpus, which contains 1,515 complete point mutation mentions annotated in 813 abstracts. Both MutationFinder and the complete corpus are publicly available at .

Journal ArticleDOI
TL;DR: A website to integrate phenotype-related information using an experimental-evidence-based approach, two types of integrated viewers to enhance the accessibility to mutant resource information, and an ontology associating international phenotypic definitions with experimental terminologies.
Abstract: Recently, a number of collaborative large-scale mouse mutagenesis programs have been launched. These programs aim for a better understanding of the roles of all individual coding genes and the biological systems in which these genes participate. In international efforts to share phenotypic data among facilities/institutes, it is desirable to integrate information obtained from different phenotypic platforms reliably. Since the definitions of specific phenotypes often depend on a tacit understanding of concepts that tends to vary among different facilities, it is necessary to define phenotypes based on the explicit evidence of assay results. We have developed a website termed PhenoSITE (Phenome Semantics Information with Terminology of Experiments: http://www.gsc.riken.jp/Mouse/), in which we are trying to integrate phenotype-related information using an experimental-evidence-based approach. The site's features include (1) a baseline database for our phenotyping platform; (2) an ontology associating international phenotypic definitions with experimental terminologies used in our phenotyping platform; (3) a database for standardized operation procedures of the phenotyping platform; and (4) a database for mouse mutants using data produced from the large-scale mutagenesis program at RIKEN GSC. We have developed two types of integrated viewers to enhance the accessibility to mutant resource information. One viewer depicts a matrix view of the ontology-based classification and chromosomal location of each gene; the other depicts ontology-mediated integration of experimental protocols, baseline data, and mutant information. These approaches rely entirely upon experiment-based evidence, ensuring the reliability of the integrated data from different phenotyping platforms.

Journal ArticleDOI
TL;DR: The originally developed software that realizes the proposed model offers functionality to fully model RNA secondary folding kinetics and the estimate of correlation between the premature transcription termination probability p and concentration c of charged amino acyl-tRNA was obtained as function p(c) for many regulatory regions in many bacterial genomes, as well as for local mutations in these regions.
Abstract: A model is proposed primarily for the classical RNA attenuation regulation of gene expression through premature transcription termination. The model is based on the concept of the RNA secondary structure macrostate within the regulatory region between the ribosome and RNA-polymerase, on hypothetical equation describing deceleration of RNA-polymerase by a macrostate and on views of transcription and translation initiation and elongation, under different values of the four basic model parameters which were varied. A special effort was made to select adequate model parameters. We first discuss kinetics of RNA folding and define the concept of the macrostate as a specific parentheses structure used to construct a conventional set of hairpins. The originally developed software that realizes the proposed model offers functionality to fully model RNA secondary folding kinetics. Its performance is compared to that of a public server described in Ref. 1. We then describe the delay in RNA-polymerase shifting to the next base or its premature termination caused by an RNA secondary structure or, herefrom, a macrostate. In this description, essential concepts are the basic and excited states of the polymerase first introduced in Ref. 2: the polymerase shifting to the next base can occur only in the basic state, and its detachment from DNA strand — only in excited state. As to the authors' knowledge, such a model incorporating the above-mentioned attenuation characteristics is not published elsewhere. The model was implemented in an application with command line interface for running in batch mode in Windows and Linux environments, as well as a public web server.3 The model was tested with a conventional Monte Carlo procedure. In these simulations, the estimate of correlation between the premature transcription termination probability p and concentration c of charged amino acyl-tRNA was obtained as function p(c) for many regulatory regions in many bacterial genomes, as well as for local mutations in these regions.

Journal ArticleDOI
TL;DR: With m fragments whose maximum length is k(1), n SNP sites and the number of the fragments covering a SNP site no more than k(2), the algorithms can solve the gapless MSR (Minimum SNP Removal) and MFR (Minimum Fragment Removal) problems in the time complexity O(nk(1)k(2) + m log m + nk( 2) + mk(1)) and O(mk(2)(2)
Abstract: The individual haplotyping problem is a computing problem of reconstructing two haplotypes for an individual based on several optimal criteria from one's fragments sequencing data. This paper is based on the fact that the length of a fragment and the number of the fragments covering a SNP (single nucleotide polymorphism) site are both very small compared with the length of a sequenced region and the total number of the fragments and introduces the parameterized haplotyping problems. With m fragments whose maximum length is k1, n SNP sites and the number of the fragments covering a SNP site no more than k2, our algorithms can solve the gapless MSR (Minimum SNP Removal) and MFR (Minimum Fragment Removal) problems in the time complexity O(nk1k2 + m log m + nk2 + mk1) and respectively. Since, the value of k1 and k2 are both small (about 10) in practice, our algorithms are more efficient and applicable compared with the algorithms of V. Bafna et al. of time complexity O(mn2) and O(m2n + m3), respectively.

Journal ArticleDOI
TL;DR: The analysis indicates a significant effect by normalization and pre-clustering methods on the clustering results, which has significance in fine-tuning the EP_GOS_Clust clustering approach.
Abstract: We study the effects on clustering quality by different normalization and pre-clustering techniques for a novel mixed-integer nonlinear optimization-based clustering algorithm, the Global Optimum Search with Enhanced Positioning (EP_GOS_Clust). These are important issues to be addressed. DNA microarray experiments are informative tools to elucidate gene regulatory networks. But in order for gene expression levels to be comparable across microarrays, normalization procedures have to be properly undertaken. The aim of pre-clustering is to use an adequate amount of discriminatory characteristics to form rough information profiles, so that data with similar features can be pre-grouped together and outliers deemed insignificant to the clustering process can be removed. Using experimental DNA microarray data from the yeast Saccharomyces Cerevisiae, we study the merits of pre-clustering genes based on distance/correlation comparisons and symbolic representations such as {+, o, -}. As a performance metric, we look at the intra- and inter-cluster error sums, two generic but intuitive measures of clustering quality. We also use publicly available Gene Ontology resources to assess the clusters' level of biological coherence. Our analysis indicates a significant effect by normalization and pre-clustering methods on the clustering results. Hence, the outcome of this study has significance in fine-tuning the EP_GOS_Clust clustering approach.

Journal ArticleDOI
TL;DR: A data mining technique that makes use of a probabilistic inference approach to uncover interesting dependency relationships in noisy, high-dimensional time series expression data and can reveal gene regulatory relationships that could be used to infer the structures of GRNs.
Abstract: Recent development in DNA microarray technologies has made the reconstruction of gene regulatory networks (GRNs) feasible. To infer the overall structure of a GRN, there is a need to find out how the expression of each gene can be affected by the others. Many existing approaches to reconstructing GRNs are developed to generate hypotheses about the presence or absence of interactions between genes so that laboratory experiments can be performed afterwards for verification. Since, they are not intended to be used to predict if a gene in an unseen sample has any interactions with other genes, statistical verification of the reliability of the discovered interactions can be difficult. Furthermore, since the temporal ordering of the data is not taken into consideration, the directionality of regulation cannot be established using these existing techniques. To tackle these problems, we propose a data mining technique here. This technique makes use of a probabilistic inference approach to uncover interesting dependency relationships in noisy, high-dimensional time series expression data. It is not only able to determine if a gene is dependent on another but also whether or not it is activated or inhibited. In addition, it can predict how a gene would be affected by other genes even in unseen samples. For performance evaluation, the proposed technique has been tested with real expression data. Experimental results show that it can be very effective. The discovered dependency relationships can reveal gene regulatory relationships that could be used to infer the structures of GRNs.

Journal ArticleDOI
TL;DR: The usefulness of eQTL to predict transcription factor binding sites is demonstrated with a real data set of a recombinant inbred line population of Arabidopsis thaliana.
Abstract: In this paper, we wanted to test whether it is possible to use genetical genomics information such as expression quantitative trait loci (eQTL) mapping results as input to a transcription factor binding site (TFBS) prediction algorithm. Furthermore, this new approach was compared to the more traditional cluster based TFBS prediction. The results of eQTL mapping are used as input to one of the top ranking TFBS prediction algorithms. Genes with observed expression profiles showing the same eQTL region are collected into eQTL groups. The promoter sequences of all the genes within the same eQTL group are used as input in the transcription factor binding site search. This approach is tested with a real data set of a recombinant inbred line population of Arabidopsis thaliana. The predicted motifs are compared to results obtained from the conventional approach of first clustering the gene expression values and then using the promoter sequences of the genes within the same cluster as input for the transcription factor binding site prediction. Our eQTL based approach produced different motifs compared to the cluster based method. Furthermore the score of the eQTL based motifs was higher than the score of the cluster based motifs. In a comparison to already predicted motifs from the AtcisDB database, the eQTL based and the cluster based method produced about the same number of hits with binding sites from AtcisDB. In conclusion, the results of this study clearly demonstrate the usefulness of eQTL to predict transcription factor binding sites.