scispace - formally typeset
Search or ask a question

Showing papers in "Genome Informatics in 2002"


Journal ArticleDOI
TL;DR: An improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix is created, and a Python and a Perl interface to the C Clustering Library is generated, thereby combining the flexibility of a scripting language with the speed of C.
Abstract: SUMMARY We have implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs. Using this library, we have created an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. In addition, we generated a Python and a Perl interface to the C Clustering Library, thereby combining the flexibility of a scripting language with the speed of C. AVAILABILITY The C Clustering Library and the corresponding Python C extension module Pycluster were released under the Python License, while the Perl module Algorithm::Cluster was released under the Artistic License. The GUI code Cluster 3.0 for Windows, Macintosh and Linux/Unix, as well as the corresponding command-line program, were released under the same license as the original Cluster code. The complete source code is available at http://bonsai.ims.u-tokyo.ac.jp/mdehoon/software/cluster. Alternatively, Algorithm::Cluster can be downloaded from CPAN, while Pycluster is also available as part of the Biopython distribution.

1,493 citations


Journal ArticleDOI
TL;DR: This work presents a comparative study on six feature selection heuristics by applying them to two sets of data, which are gene expression profiles from Acute Lymphoblastic Leukemia and proteomic patterns from ovarian cancer patients.
Abstract: Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on features chosen by these methods, error rates of several classification algorithms were obtained for analysis. Our results demonstrate the importance of feature selection in accurately classifying new samples.

455 citations


Journal ArticleDOI
TL;DR: This work developed a database for the potentially interacting domain pairs (PID) extracted from a dataset of experimentally identified interacting protein pairs with InterPro, an integrated database of protein families, domains and functional sites and provided a valuable tool for functional prediction of unknown proteins.
Abstract: Protein-protein interaction plays a critical role in biological processes. The identication of interacting proteins by computational methods can provide new leads in functional studies of uncharacterized proteins without performing extensive experiments. We developed a database for the potentially interacting domain pairs (PID) extracted from a dataset of experimentally identied interacting protein pairs (DIP: database of interacting proteins) with InterPro, an integrated database of protein families, domains and functional sites. In developing protein interaction databases and predictive methods, sensitive statistical scoring systems is critical to provide a reliability index for accurate functional analysis of interaction networks. We present a statistical scoring system, named \PID matrix score" as a measure of the interaction probability (interactability) between domains. This system provided a valuable tool for functional prediction of unknown proteins. For the evaluation of PID matrix, cross validation was performed with subsets of DIP data. The prediction system gives about 50% sensitivity and more than 98% specicit y, which implies that the information for interacting proteins pairs could be enriched about 30 fold with the PID matrix. It is demonstrated that mapping of the genome-wide interaction network can be achieved by using the PID matrix.

124 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method for applied biological regulation technology in the field of bio-engineering at Kyusyu University and the Graduate School of Bioresource and Bioenvironmental Sciences (GSBS) at Kyoto University.
Abstract: 1 Laboratory for Applied Biological Regulation Technology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyusyu University, 6-10-1 Hakozaki, Higashiku, Fukuoka 812-8581, Japan 2 Institute of Basic Medical Sciences, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8575, Japan 3 Bioscience Division, Mitsui Knowledge Industry Co, Ltd, Harmony tower 21th Floor, 1-32-2 Honcho, Nakano-ku, Tokyo 164-8721, Japan

91 citations


Journal ArticleDOI
TL;DR: A method that automatically generates classifications of gene-product functions using bibliographic information is proposed and the comparison with the well accepted GO ontology points to different situations in which the automatically derived classification can be useful for assisting human experts in the annotation of ontologies.
Abstract: Detailed classifications, controlled vocabularies and organised terminology are widely used in different areas of science and technology. Their relatively recent introduction in molecular biology has been crucial for progress in the analysis of genonics and massive proteomics experiments. Unfortunately the construction of the ontologies, including terminology, classification and entity relations requires considerable effort, including the analysis of massive amounts of literature. We propose here a method that automatically generates classifications of gene-product functions using bibliographic information. The corresponding classification structures mirror the ones constructed by human experts. The analysis of a large structure built for yeast gene-products, and the detailed inspection of various examples, show encouraging properties. In particular, the comparison with the well accepted GO ontology points to different situations in which the automatically derived classification can be useful for assisting human experts in the annotation of ontologies.

66 citations


Journal ArticleDOI
TL;DR: It is shown that feature generation together with correlation based feature selection can be used with a variety of machine learning algorithms to give highly accurate translation initiation site prediction and the results achieve comparable accuracy to the best existing approaches.
Abstract: Correct prediction of the translation initiation site (TIS) is an important issue in genomic research. We show that feature generation together with correlation based feature selection can be used with a variety of machine learning algorithms to give highly accurate translation initiation site prediction. Only very few features are needed and the results achieve comparable accuracy to the best existing approaches. Our approach has the advantage that it does not require one to devise a special prediction method; rather standard machine learning classiers are shown to give very good performance on the selected features. The raw and generated features which we have found to be important are the following: positions 3 and 1 in the sequence; upstream k-grams for k=3, 4, and 5; stop-codon frequency; downstream in-frame 3-gram; and the distance of ATG to the beginning of the sequence. The best result, with an overall accuracy of 90%, is obtained by selecting only seven features from this set. The same features retrained with the use of a scanning model achieves an overall accuracy of 94% on this dataset.

64 citations


Journal ArticleDOI
TL;DR: In this study, a in silico metabolic pathway network of Escherichia coli consisting of 301 reactions and 294 metabolites is constructed and it was found that pyruvate carboxylation pathway should be used rather than phosphoenolpyruvating pathway for its optimal production in E. coli.
Abstract: The intracellular metabolic uxes can be calculated by metabolic ux analysis, which uses a stoichiometric model for the intracellular reactions along with mass balances around the intracellular metabolites. In this study, we have constructed in silico metabolic pathway network of Escherichia coli consisting of 301 reactions and 294 metabolites. Metabolic ux analyses were carried out to estimate ux distributions to achieve the maximum in silico yield of succinic acid in E. coli. The maximum in silico yield of succinic acid was only 83% of its theoretical yield. The lower in silico yield of succinic acid was found to be due to the insucien t reducing power, which could be increased to its theoretical yield by supplying more reducing power. Furthermore, the optimal metabolic pathways for the production of succinic acid could be proposed based on the results of metabolic ux analyses. In the case of succinic acid production, it was found that pyruvate carboxylation pathway should be used rather than phosphoenolpyruvate carboxylation pathway for its optimal production in E. coli. Then, the in silico optimal succinic acid pathway was compared with conventional succinic acid pathway through minimum set of wet experiments. The results of wet experiments indicate that the pathway predicted by in silico analysis is more ecien t than conventional pathway.

54 citations


Journal ArticleDOI
TL;DR: Here it is shown that the extremely fast "resampling of estimated log likelihoods" or RELL method behaves well under more general circumstances than previously examined and approximates the bootstrap (BP) proportions of trees better that some bootstrap methods that rely on fast heuristics to search the tree space.
Abstract: Evolutionary trees sit at the core of all realistic models describing a set of related sequences, including alignment, homology search, ancestral protein reconstruction and 2D/3D structural change. It is important to assess the stochastic error when estimating a tree, including models using the most realistic likelihood-based optimizations, yet computation times may be many days or weeks. If so, the bootstrap is computationally prohibitive. Here we show that the extremely fast \resampling of estimated log likelihoods" or RELL method behaves well under more general circumstances than previously examined. RELL approximates the bootstrap (BP) proportions of trees better that some bootstrap methods that rely on fast heuristics to search the tree space. The BIC approximation of the Bayesian posterior probability (BPP) of trees is made more accurate by including an additional term related to the determinant of the information matrix (which may also be obtained as a product of gradient or score vectors). Such estimates are shown to be very close to MCMC chain values. Our analysis of mammalian mitochondrial amino acid sequences suggest that when model breakdown occurs, as it typically does for sequences separated by more than a few million years, the BPP values are far too peaked and the real uctuations in the likelihood of the data are many times larger than expected. Accordingly, several ways to incorporate the bootstrap and other types of direct resampling with MCMC procedures are outlined. Genes evolve by a process which involves some sites following a tree close to, but not identical with, the species tree. It is seen that under such a likelihood model BP (bootstrap proportions) and BPP estimates may still be reasonable estimates of the species tree. Since many of the methods studied are very fast computationally, there is no reason to ignore stochastic error even with the slowest ML or likelihood based methods.

46 citations


Journal ArticleDOI
TL;DR: This work introduces a new cost model in which the lengths of the reversed sequences play a role, allowing more flexibility in accounting for mutation phenomena, and proposes an efficient, novel algorithm that takes length into account as an optimization criterion.
Abstract: Current algorithmic studies of genome rearrangement ignore the length of reversals (or inversions); rather, they only count their number. We introduce a new cost model in which the lengths of the reversed sequences play a role, allowing more flexibility in accounting for mutation phenomena. Our focus is on sorting unsigned (unoriented) permutations by reversals; since this problem remains difficult (NP-hard) in our new model, the best we can hope for are approximation results. We propose an efficient, novel algorithm that takes (a monotonic function f of) length into account as an optimization criterion and study its properties. Our results include an upper bound of O(fn lg2n) for any additive cost measure f on the cost of sorting any n-element permutation, and a guaranteed approximation ratio of O(lg2n) times optimal for sorting a given permutation. Our work poses some interesting questions to both biologists and computer scientists and suggests some new bioinformatic insights that are currently being studied.

44 citations


Journal ArticleDOI
TL;DR: An unsupervised neural network algorithm, Kohonen's self-organizing map (SOM), is used to analyze di- and trinucleotide frequencies in 9 eukaryotic genomes of known sequences to recognize species-specific characteristics that are signature representations of each genome.
Abstract: With the increasing amount of available genome sequences, novel tools are needed for comprehensive analysis of species-specific sequence characteristics for a wide variety of genomes. We used an unsupervised neural network algorithm, Kohonen’s self-organizing map (SOM), to analyze diand trinucleotide frequencies in 9 eukaryotic genomes of known sequences (a total of 1.2 Gb); S. cerevisiae, S. pombe, C. elegans, A. thaliana, D. melanogaster, Fugu, and rice, as well as P. falciparum chromosomes 2 and 3, and human chromosomes 14, 20, 21, and 22, that have been almost completely sequenced. Each genomic sequence with different window sizes was encoded as a 16and 64-dimensional vector giving relative frequencies of di- and trinucleotides, respectively. From analysis of a total of 120,000 nonoverlapping 10-kb sequences and overlapping 100-kb sequences with a moving step size of 10 kb, derived from a total of the 1.2 Gb genomic sequences, clear species-specific separations of most sequences were obtained with the SOMs. The unsupervised algorithm could recognize, in most of the 120,000 10-kb sequences, the species-specific characteristics (key combinations of oligonucleotide frequencies) that are signature representations of each genome. Because the classification power is very high, the SOMs can provide fundamental bioinformatic strategies for extracting a wide range of genomic information that could not otherwise be obtained.

39 citations



Journal ArticleDOI
TL;DR: In this article, a maximum clique-based algorithm was proposed for spot matching for two-dimensional gel electrophoresis images, protein structure alignment and protein side-chain packing.
Abstract: We developed maximum clique-based algorithms for spot matching for two-dimensional gel electrophoresis images, protein structure alignment and protein side-chain packing, where these problems are known to be NP-hard. Algorithms based on direct reductions to the maximum clique can find optimal solutions for instances of size (the number of points or residues) up to 50-150 using a standard PC. We also developed pre-processing techniques to reduce the sizes of graphs. Combined with some heuristics, many realistic instances can be solved approximately.

Journal ArticleDOI
TL;DR: Experimental results of benchmarks from the BAliBASE show that the proposed method is superior to MSA, OMA, and SAGA methods with regard to quality of solution and running time.
Abstract: This paper presents a parallel hybrid genetic algorithm (GA) for solving the sum-of-pairs multiple protein sequence alignment. A new chromosome representation and its corresponding genetic operators are proposed. A multi-population GENITOR-type GA is combined with local search heuristics. It is then extended to run in parallel on a multiprocessor system for speeding up. Experimental results of benchmarks from the BAliBASE show that the proposed method is superior to MSA, OMA, and SAGA methods with regard to quality of solution and running time. It can be used for finding multiple sequence alignment as well as testing cost functions.

Journal ArticleDOI
TL;DR: Novel kernels that measure similarity of two RNA sequences, taking account of their secondary structures are presented, including the marginalized count kernel (MCK), which employs stochastic context-free grammar for estimating the secondary structure.
Abstract: We present novel kernels that measure similarity of two RNA sequences, taking account of their secondary structures. Two types of kernels are presented. One is for RNA sequences with known secondary structures, the other for those without known secondary structures. The latter employs stochastic context-free grammar (SCFG) for estimating the secondary structure. We call the latter the marginalized count kernel (MCK). We show computational experiments for MCK using 74 sets of human tRNA sequence data: (i) kernel principal component analysis (PCA) for visualizing tRNA similarities, (ii) supervised classification with support vector machines (SVMs). Both types of experiment show promising results for MCKs.

Journal ArticleDOI
TL;DR: The method for measuring the reliability of the estimated gene network by using the bootstrap method is proposed, which shows good results in both the accuracy and the efficiency of the estimation.
Abstract: The development of the microarray technology provides us a huge amount of gene expression profiles. The estimation of a gene network has received considerable attention in the field of bioinformatics and several methodologies have been proposed such as the Boolean network [1], the Bayesian network [3, 4, 5] and so on. In this paper, we propose the method for measuring the reliability of the estimated gene network by using the bootstrap method [2].

Journal ArticleDOI
Naoki Sato1
TL;DR: Results suggest that 238 groups that are common to all organisms analyzed may define a minimal set of gene groups, and only 80 groups are identified as the gene groups that could not have been acquired by plants without cyanobacterial endosymbiosis.
Abstract: Chloroplast genome originates from the genome of ancestral cyanobacterial endosymbiont. The comparison of the genomes of cyanobacteria and plants has been made possible by the advance in genome sequencing. I report here current results of our computational efforts to compare the genomes of cyanobacteria and plants and to trace the process of evolution of cyanobacteria, chloroplasts and plants. Cyanobacteria form a clearly defined monophyletic clade with reasonable level of diversity and are ideal for testing various approaches of genome comparison. Analysis of short sequence features such as genome signature was found to be useful in characterizing cyanobacterial genomes. Comparison of genome contents was performed by homology grouping of predicted protein coding sequences, rather than orthologue-based comparison, to minimize effects of multi-domain proteins and large protein families, both of which are important in cyanobacterial genomes. Comparison of the genomes of six species of cyanobacteria suggests that there are a number of species-specific additions of protein genes, and this information is useful in reconstructing phylogenetic relationship. The homology groups in cyanobacteria were used as a reference to compare plants and non-photosynthetic organisms. The results suggest that 238 groups that are common to all organisms analyzed may define a minimal set of gene groups. In addition, only 80 groups are identified as the gene groups that could not have been acquired by plants without cyanobacterial endosymbiosis. Further study is needed to identify plant genes of cyanobacterial origin.

Journal ArticleDOI
TL;DR: A new approach to pattern discovery called string pattern regression is presented, where a data set is given that consists of a string attribute and an objective numerical attribute, and an exact but efficient branch-and-bound algorithm is presented which is applicable to various pattern classes.
Abstract: We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.

Journal ArticleDOI
TL;DR: A dynamic Bayesian network and nonparametric regression model for estimating a gene network with cyclic regulations from time series microarray data is proposed and a criterion for selecting a network from Bayes approach is derived.
Abstract: A Bayesian network is a powerful tool for modeling relations among a large number of random variables. Therefore the Bayesian network has received considerable attention from the studies of gene network estimation using microarray gene expression data. Imoto et al. [1, 2] proposed a Bayesian network and nonparametric regression model for capturing nonlinear relations between genes from the continuous gene expression data. However, a Bayesian network still has a problem that it cannot construct cyclic regulations, while real gene networks have cyclic regulations. For a solution of this problem, in this paper, we propose a dynamic Bayesian network and nonparametric regression model for estimating a gene network with cyclic regulations from time series microarray data. We also derive a criterion for selecting a network from Bayes approach. The effectiveness of our method is displayed though the analysis of the Saccharomyces cerevisiae gene expression data.

Journal ArticleDOI
TL;DR: In this paper, the inference of the interactions in more large scale of gene expression networks is attempted, and new efficient approaches to narrow down the candidates that explain the observed time-courses within the immense huge searching space of parameter values are proposed.
Abstract: Recent advances of powerful new technologies such as DNA microarrays provide a mass of gene expression data on a genomic scale. One of the most important projects in post-genome-era is the system identification of gene networks by using these observed data. We previously introduced an efficient numerical optimization technique by using time-course data of system components, which is based on real-coded genetic algorithm (RCGAs) to estimate the reaction coefficients among system components of a dynamic network model called S-system [3] that is a type of power-low formalism and is suitable for description of organizationally complex systems such as gene expression networks and metabolic pathways. This technique with the combination of one of the crossover operators for RCGAs called unimodal normal distribution crossover (UNDX) [1] with the alternation of generation model called minimal generation gap (MGG) [2] showed remarkable superiority to the simple GA in case of simple oscillatory system [4]. However this case study belongs to a comparative easy inverse problem; the number of system components was 2 and the estimated parameters was 12. For application to gene networks including huge number of estimated parameters, our new optimization techniques have to be adapted to inverse problem with more strict circumstances. In this paper, we shall attempt to the inference of the interactions in more large scale of gene expression networks. In the case study, we also propose new efficient approaches to narrow down the candidates that explain the observed time-courses within the immense huge searching space of parameter values.

Journal ArticleDOI
TL;DR: This paper presents a discrete mathematical description of the ion transport across cell membranes in terms of the π-calculus process algebra and motivates the use of theπcalculus as an adequate formalism for molecular processes by describing the dynamics of the Na pump.
Abstract: Integration of biological data, modelling and simulation of the biological systems become important research topics. Biology should adopt theoretical frameworks of physics, mathematics and computer science to challenge the enormous number of interacting molecules. Cell behaviour and molecular processes are usually described in biology by partial differential equations. These equations often fail to express molecular interactions or to represent systems with a small number of molecules. We propose a discrete mathematical tool called the π-calculus to model interactions and subsequent state transitions. The model provides a computational framework that allows an automated verification of system properties. This paper presents a discrete mathematical description of the ion transport across cell membranes in terms of the π-calculus process algebra. We motivate the use of the πcalculus as an adequate formalism for molecular processes by describing the dynamics of the Na pump. The Albers-Post mechanism is translated into an elegant π-calculus model outlining molecular interactions, conformational transformations, and ion transportation of the pumping process. We use a sophisticated software tool to verify some properties of the described system.

Journal ArticleDOI
TL;DR: A workbench called SequeX is introduced for the analysis and visualization of whole genome sequences using SSB-tree (Static SB-tree) and can be used to identify conserved genes or sequences by the analysis of the common k-mers and annotation.
Abstract: As sequenced genomes become larger and sequencing process becomes faster, there is a need to develop a tool to analyze sequences in the whole genomic scale. However, on-memory algorithms such as sux tree and sux array are not applicable to the analysis of whole genome sequence set, since the size of individual whole genome ranges from several million base pairs to hundreds billion base pairs. In order to eectiv ely manipulate the huge sequence data, it is necessary to use the indexed data structure for external memory. In this paper, we introduce a workbench called SequeX for the analysis and visualization of whole genome sequences using SSB-tree (Static SB-tree). It consists of two parts: the analysis query subsystem and the visualization subsystem. The query subsystem supports various transactions such as pattern matching, k-occurrence, and k-mer analysis. The visualization subsystem helps biologists to easily understand whole genome structure and feature by sequence viewer, annotation viewer, CGR (Chaos Game Representation) viewer, and k-mer viewer. The system also supports a user-friendly programming interface based on Java script for batch processing and the extension for a specic purpose of a user. SequeX can be used to identify conserved genes or sequences by the analysis of the common k-mers and annotation. We analyze the common k-mer for 72 microbial genomes announced by Entrez, and nd an interesting biological fact that the longest common k-mer for 72 sequences is 11-mer, and only 11 such sequences exist. Finally we note that many common k-mers occur in conserved region such as CDS, rRNA, and tRNA.

Journal ArticleDOI
TL;DR: It is shown that a perceptron can be trained to use the deformation propensity at each step in a sequence to generate such weights, and that applying non-uniform weights to the contribution of each base step to aggregate deformation inclination can greatly improve classification accuracy.
Abstract: We examine the use of deformation propensity at individual base steps for the identification of DNA-protein binding sites. We have previously demonstrated that estimates of the total energy to bend DNA to its bound conformation can partially explain indirect DNA-protein interactions. We now show that the deformation propensities at each base step are not equally informative for classifying a sequence as a binding site, and that applying non-uniform weights to the contribution of each base step to aggregate deformation propensity can greatly improve classification accuracy. We show that ap erceptron can be trained to use the deformation propensity at each step in a sequence to generate such weights.


Journal ArticleDOI
TL;DR: A virtual private network (VPN) is adopted to protect a Grid system from/to the outer world and for bioinformatics environment, set the following design goals to provide.
Abstract: Recently, bioinformatics requires high performance computing facilities for homology search, molecular simulation, cell simulation etal. Grid computing [1] has a potential for expansion in computing performance by connecting a large number of computers or PC clusters with high performance networks. Along with this line, we have designed and developed the Open Bioinformatics Grid (OBIGrid [2] http://www.obigrid.org/ ) in corporation with Japan Committee on High-Performance Computing for Bioinformatics and Initiative for Parallel Bioinformatics Processing (IPAB). On designing OBIGrid, we emphasize the importance of network transparency and security policy issues rather than the performance at the early development stage. Therefore, we adopt a virtual private network (VPN) to protect a Grid system from/to the outer world. As for bioinformatics environment, we set the following design goals to provide:

Journal ArticleDOI
TL;DR: A strategy is proposed here based on a different concept of sequence homology derived from a periodicity analysis of the physicochemical properties of the residues constituting proteins primary structures that underscores in many cases other methodologies in the twilight zone.
Abstract: Divergence in sequence through evolution precludes sequence alignment based homology methodologies for protein folding prediction from detecting structural and folding similarities for distantly related protein. Homolog coverage of actual data bases is also a factor playing a critical role in the performance of those methodologies, the factor being conspicuously apparent in what is called the twilight zone of sequence homology in which proteins of high degree of similarity in both biological function and structure are found but for which the amino acid sequence homology ranges from about 20% to less than 30%. In contrast to these methodologies a strategy is proposed here based on a different concept of sequence homology. This concept is derived from a periodicity analysis of the physicochemical properties of the residues constituting proteins primary structures.. The analysis is performed using a front-end processing technique in automatic speech recognition by means of which the cepstrum (measure of the periodic wiggliness of a frequency response) is computed that leads to a spectral envelope that depicts the subtle periodicity in physicochemical characteristics of the sequence. Homology in sequences is then derived by alignment of spectral envelopes. Proteins sharing common folding patterns and biological function but low sequence homology can then be detected by the similarity in spectral dimension.. The methodology applied to protein folding recognition underscores in many cases other methodologies in the twilight zone.

Journal ArticleDOI
TL;DR: In this article, the authors developed a reliable prediction system of protein-protein interaction sites from their three-dimensional structure, which can predict the interaction sites on the protein surface from protein interaction data.
Abstract: Protein-protein interactions play an important role in various biological processes. Over the past few years, several studies have been made on protein interface and those results enable us to obtain massive data on various aspects of protein interface. The problem that we have to consider next is predicting the interaction sites on the protein surface. Although several prediction methods are developed, there is still room for improvement. The purpose of this study is to develop a reliable prediction system of protein-protein interaction sites from their three-dimensional structure.

Journal ArticleDOI
TL;DR: It is found that G or C spectral curves have flat region at middle frequency range from f = 10(-4) to 10(-5) (corresponding to cyclic size 1 kb-5 kb), which may be associated with randomness of base sequence composition.
Abstract: In the present study, we identified periodic patterns in nucleotide sequence, and characterized nucleotide sequences that confer periodicities to Arabidopsis thaliana and Drosophila melanogaster on the basis of a power spectrum method and frequency of nucleotide sequences. To assign regions that contribute to each periodicity we calculated periodic nucleotide distributions by a parameter proposed in the paper. In A. thaliana, we obtained three periodicities (248 bp-, 167 bp-, and 126 bp) in chromosome 3, three peaks (174 bp-, 88 bp-, and 59 bp-period) in chromosome 4, and four periodicities (356 bp, 174 bp, 88 bp, and 59 bp) in chromosome 5. These are relation to ORF that consists of Gly-rich amino acid sequences including histone protein that consists of Gly-, Ser-, and Ala-rich amino acids residues. For D. melanogaster genome we found that G or C spectral curves have flat region at middle frequency range from f = 10(-4) to 10(-5) (corresponding to cyclic size 1 kb-5 kb), which may be associated with randomness of base sequence composition. This property has not been observed in Saccharomyces cerevisiae, Caenorhabditis elegans, and Homo sapiens yet.

Journal ArticleDOI
TL;DR: The results of comparison of algorithms for remote homology detection using the SCOP database are shown and a new SVM based method (SVM-SW) is proposed, which uses the Smith-Waterman (SW) dynamic programming algorithm as a kernel function.
Abstract: Remote homology detection for protein sequences is one of the important and well-studied problems in Bioinformatics. Many algorithms have been developed for this purpose. The Smith-Waterman (SW) dynamic programming algorithm was developed in early 1980’s [8], and is still used widely today. In 1990’s, many methods were developed based on profiles [1] and hidden Markov models [2, 4]. In 2000’s, methods using SVMs (support vector machines) were developed such as the SVM-Fisher method [3]. Recently, Liao and Noble proposed the SVM-pairwise method [5], which uses a vector of pairwise similarities with all proteins in the training set. Quite recently, we proposed a new SVM based method (SVM-SW), which uses the SW algorithm as a kernel function [7]. Though we do not yet succeed to prove that the SW score is always a valid kernel, SVM-SW worked successfully in all cases we tested. In this poster abstract, we briefly show the results of comparison of algorithms for remote homology detection using the SCOP database [6].

Journal ArticleDOI
TL;DR: The improvement of PSORT II is reported from three aspects: the employment of mammalian (murine) data, the optimization of the learning method, and the optimized of the sequence features used.
Abstract: The PSORT system [8] is a unique tool for the prediction of protein subcellular localization in a sense that it can deal with proteins localized at almost all the subcellular compartments. In its several versions, PSORT II [5] was developed for the prediction of eukaryotic proteins using yeast sequences as its training data. The reason why the data from a single species were used was that training data were favored to reflect the subcellular proportion of a proteome. However, since the yeast is a unicellular organism, applying PSORT II to sequences of multicellular organisms can be problematic sometimes. For example, it has been pointed out that secretary proteins tend to be under-predicted. Since the first release of PSORT II, genome projects have been producing rich information of genes for many model organisms including yeasts, nematode, mouse and human. Amongst them, the information of mouse genes is managed in Mouse Genome Database [2] (MGD) and Mouse Genome Informatics (MGI). The information includes the data of the full length cDNAs [7] with the annotation of subcellular localization sites of their products. In this work, we report the improvement of PSORT II from three aspects: the employment of mammalian (murine) data, the optimization of the learning method, and the optimization of the sequence features used.

Journal ArticleDOI
TL;DR: This first application of computer modeling to development of the nematode C. elegans focuses on the cellular arrangement in early embryos, and finds that cell rounding and stiffening only during the period of cell division were effective to generate almost identical cellular arrangements to in real embryos.
Abstract: The ultimate goal of bioinformatics is to reconstruct biological systems in a computer. Biological systems have a multi-scale and multi-level biological hierarchy. The cellular level of the hierarchy is appropriate and practicable for reconstructing biological systems by computer modeling. In our first application of computer modeling to development of the nematode C. elegans, we focus on the cellular arrangement in early embryos. This plays a very important role in cell fate determination by cell-cell interaction, which is largely restricted by physical conditions. We have already constructed a computer model of a C. elegans embryo, currently up to the 4-cell stage, using deformable and dividable geometric graphics. Modeling components of the embryo are based solely on cellular-level dynamics. Here, we modeled new physical phenomena of cell division, cell rounding and stiffening; we then combined them with already modeled phenomena, contractile ring contraction and cell elongation. We investigated effectiveness of the new model on cellular arrangement by computer simulations. We found that cell rounding and stiffening only during the period of cell division were effective to generate almost identical cellular arrangements to in real embryos. Since cells could be soft during the period between cell divisions, implementation of the new model resulted in cell shapes similar to real embryos. The nature of the model and its relationship to real embryos are discussed.