scispace - formally typeset
Search or ask a question

Showing papers in "Genome Informatics in 1998"


Journal ArticleDOI
TL;DR: This work has selected the most frequently seen verbs from raw texts made up of 1-million-words of Medline abstracts, and it was able to identify (or bracket) noun phrases contained in the corpus, with a precision rate of 90%.
Abstract: We have selected the most frequently seen verbs from raw texts made up of 1-million-words of Medline abstracts, and we were able to identify (or bracket) noun phrases contained in the corpus, with a precision rate of 90%. Then, based on the noun-phrase-bracketted corpus, we tried to find the subject and object terms for some frequently seen verbs in the domain. The precision rate of finding the right subject and object for each verb was about 73%. This task was only made possible because we were able to linguistically analyze (or parse) a large quantity of a raw corpus. Our approach will be useful for classifying genes and gene products and for identifying the interaction between them. It is the first step of our effort in building a genome-related thesaurus and hierarchies in a fully automatic way.

216 citations


Journal ArticleDOI
TL;DR: A program for the identification of gene symbols and names inside sentences has been devised, made up of a series of sieves of different natures, lexical, morphological and semantic, to distinguish among the words of a sentence those which can only be potential gene symbols or names.
Abstract: Gathering data on molecular interactions to be fed into a specialized database has motivated the development of a computer system to help extracting pertinent information from texts, relying on advanced linguistic tools, completed with object-oriented knowledge modeling capabilities. As a first step toward this challenging objective, a program for the identification of gene symbols and names inside sentences has been devised. The main difficulty is that these names and symbols do not appear to follow construction rules. The program is thus made up of a series of sieves of different natures, lexical, morphological and semantic, to distinguish among the words of a sentence those which can only be potential gene symbols or names. Its performance has been evaluated, in terms of coverage and precision ratios, on a corpus of texts concerning D. melanogaster for which the list of names of known genes is available for checking.

162 citations


Journal ArticleDOI
TL;DR: The results from the two predictors suggest that disordered regions comprise a sequence-dependant category distinct from that of ordered protein structure.
Abstract: Using ordered and disordered regions identified either by X-ray crystallography or by NMR spectroscopy, we trained neural networks to predict order and disorder from amino acid sequence. Although the NMR-based predictor initially appeared to be much better than the one based on the X-ray data, both predictors yielded similar overall accuracies when tested on each other's training sets, and indicated similar regions of disorder upon each sequence. The predictors trained with X-ray data showed similar results for a 5-cross validation experiment and for the out-of-sample predictions on the NMR characterized data. In contrast, the predictor trained with NMR data gave substantially worse accuracies on the out-of-sample X-ray data as compared to the accuracies displayed by the 5-cross validation during the network training. Overall, the results from the two predictors suggest that disordered regions comprise a sequence-dependant category distinct from that of ordered protein structure.

147 citations


Journal ArticleDOI
TL;DR: A simulator of boolean networks without time delay is presented, which includes a genetic network identifier with a graphic interface that generates instructions for experiments of gene disruptions and overexpressions.
Abstract: A hot research topic in genomics is to analyze the interactions between genes by systematic gene disruptions and gene overexpressions. Based on a boolean network model without time delay, we have been investigating efficient strategies for identifying a genetic network by multiple gene disruptions and overexpressions. This paper first shows the relationship between our boolean network model without time delay and the standard synchronous boolean network model. Then we present a simulator of boolean networks without time delay for multiple gene disruptions and gene overexpressions, which includes a genetic network identifier with a graphic interface that generates instructions for experiments of gene disruptions and overexpressions.

139 citations


Journal ArticleDOI
TL;DR: The gyrB gene is chosen, because it is rarely transmitted horizontally, its molecular evolution rate is higher than that of 16S rRNA, and the gene is distributed ubiquitously among bacterial species.
Abstract: Nucleotide sequences of small-subunit rRNA (16S rRNA) are most commonly used for the identification and characterization of bacteria and their complex communities. However, 16S rRNA evolves slowly and is often not very convenient to resolve bacterial strains at the species level. We have therefore attempted to develop a rapid and more convenient system for bacterial identification using the gyrB gene sequences. We chose the gyrB gene, because (i) it is rarely transmitted horizontally, (ii) its molecular evolution rate is higher than that of 16S rRNA, and (iii) the gene is distributed ubiquitously among bacterial species. We PCR-amplified the 1.2 kb-long gyrB segments from about 1,000 bacterial species by using degenerate primers and determined their nucleotide sequences. The resultant data have been assembled into the gyrB database accessible via WWW.

97 citations


Journal ArticleDOI
TL;DR: Attributes based on cysteine, the aromatics, flexible tendencies, and charge were found to be the best attributes for distinguishing order and disorder among those tested so far.
Abstract: The conditional probability, P(s|x), is a statement of the probability that the event, s, will occur given prior knowledge for the value of x. If x is given and if s is randomly distributed, then an empirical approximation of the true conditional probability can be computed by the application of Bayes' Theorem. Here s represents one of two structural classes, either ordered, s (o), or disordered, s (d), and x represents an attribute value calculated over a window of 21 amino acids. Plots of P(s|x) versus x provide information about the correlation between the given sequence attribute and disorder or order. These conditional probability plots allow quantitative comparisons between individual attributes for their ability to discriminate between order and disorder states. Using such quantitative comparisons, 38 different sequence attributes have been rank-ordered. Attributes based on cysteine, the aromatics, flexible tendencies, and charge were found to be the best attributes for distinguishing order and disorder among those tested so far.

55 citations


Journal ArticleDOI
TL;DR: The PAPIA (PArallel Protein Information Analysis) system performs fast parallel processing for typical calculations in protein analysis, such as structure similarity search, sequence homology search and multiple sequence alignment, nearly 60 times faster than a single processor.
Abstract: Protein information analysis is widely regarded as a key technology in drug design, macromolecular engineering, and understanding genome sequences. Because vast amount of calculations are required, further speed-up for protein information analysis is very much in demand. We have implemented the PAPIA (PArallel Protein Information Analysis) system on the RWC PC cluster IIa (PAPIA cluster) which consists of 64 Pentium Pro 200MHz microprocessors. The PAPIA system performs fast parallel processing for typical calculations in protein analysis, such as structure similarity search, sequence homology search and multiple sequence alignment, nearly 60 times faster than a single processor. We have started a WWW service (http://www.rwcp.or.jp/papia/), allowing any biologist to easily submit jobs to the PAPIA system through a WWW browser. The user can experience the power of current parallel processing technology.

35 citations


Journal ArticleDOI
TL;DR: It is possible to detect DNA molecular changes such as deletions, additions, amplifications or DNA methylations occurring at or near to the restriction enzyme cleavage sites by means of comparing large amount of RLGS electrophoretograms, without any visual inspection and human interaction.
Abstract: We have developed the fully-automated algorithms for processing 2-D gel electrophoretograms based on RLGS (restriction landmark genomic scanning) method; one for fully-automated spot recognition from RLGS electrophoretogram and another for fully-automated pairwise matching of the spots found on such 2-D electrophoretograms. Without any human interaction, several thousands of spots on a 2-D electrophoretogram, including hidden spots found at the shoulder of large spots, can be identified correctly by applying our spot recognition algorithm, except for only a few true-negative and false-positive spots. Once the locations and intensities of the landmark spots are correctly recognized automatically, our pairwise spot matching algorithm reliably and rapidly identifies equivalent pairs of spots found on the nonlinearly distorted RLGS electrophoretograms in the fully-automatic way, i.e., the boring and annoying spot landmarking process is unnecessary. At the beginning of the spot matching process, most suitable pair of corresponding spots is searched automatically, then the other equivalent pairs of spots are identified. With our powerful image processing algorithms, it is possible to detect DNA molecular changes such as deletions, additions, amplifications or DNA methylations occurring at or near to the restriction enzyme cleavage sites by means of comparing large amount of RLGS electrophoretograms, without any visual inspection and human interaction.

28 citations


Journal ArticleDOI
TL;DR: New powerful estimators utilizing k >/= 3 dimensional sub-alignments are presented, and a new bounding technique using V (Delta), a set of vertices in the paths whose lengths are at most Delta longer than the shortest path is proposed.
Abstract: The alignment problem of DNA or protein sequences is very applicable and important in various fields of molecular biology. This problem can be reduced to the shortest path problem and Ikeda and Imai (Genome Informatics 5: 90-99, 1994) showed that the A(*) algorithm works efficiently with the estimator utilizing all 2-dimensional sub-alignments. In this paper we present new powerful estimators utilizing k >/= 3 dimensional sub-alignments, and propose a new bounding technique using V (Delta), a set of vertices in the paths whose lengths are at most Delta longer than the shortest path. We also extend our algorithm to a recursive-estimate version. These algorithms become more efficient when the number of sequences increase, or the similarity among sequences is lower.

24 citations


Journal ArticleDOI
TL;DR: This paper explains some of the current efforts for developing various NLP-based tools for tackling genome-related on-line documents for information extraction task.
Abstract: Huge quantities of on-line medical texts such as Medline are available, and we would hope to extract useful information from these resources, as much as possible, hopefully in an automatic way, with the aid of computer technologies. Especially, recent advances in Natural Language Processing (NLP) techniques raise new challenges and opportunities for tackling genome-related on-line text; combining NLP techniques with genome informatics extends beyond the traditional realms of either technology to a variety of emerging applications. In this paper, we explain some of our current efforts for developing various NLP-based tools for tackling genome-related on-line documents for information extraction task.

22 citations


Journal ArticleDOI
TL;DR: It is shown that if there exists a weighted network which is consistent with given data, the authors can find it in polynomial time, and also considers the optimization problem, where the problem is NP-hard.
Abstract: We study the problem of finding a genetic network from data obtained by multiple gene disruptions and overexpressions We define a genetic network as a weighted graph, and analyze the computational complexity of the problem We show that if there exists a weighted network which is consistent with given data, we can find it in polynomial time Moreover, we also consider the optimization problem, where we try to find an optimally consistent weighted network with given data We show that the problem is NP-hard On the other hand, we give a polynomial-time approximation algorithm to solve it with approximation ratio 2 We report some simulation results on experiments

Journal ArticleDOI
TL;DR: The system and the actual analysis of the complete genome of Pyrococcus horikoshii to identify ABC transporters is described and the ortholog group table is described for the cases where the genes are clustered in physically close positions in the genome for at least one organism.
Abstract: In order to fully make use of the vast amount of information in the complete genome sequences, we are developing a genome-scale system for predicting gene functions and cellular functions. The system makes use of the information of sequence similarity, the information of positional correlations in the genome, and the reference knowledge stored as the ortholog group tables in KEGG (Kyoto Encyclopedia of Genes and Genomes). The ortholog group table summarizes orthologous and paralogous relations among different organisms for a set of genes that are considered to form a functional unit, such as a conserved portion of the metabolic pathway or a molecular machinery for the membrane transport. At the moment, the ortholog group table is constructed for the cases where the genes are clustered in physically close positions in the genome for at least one organism. In this paper, we describe the system and the actual analysis of the complete genome of Pyrococcus horikoshii to identify ABC transporters.

Journal ArticleDOI
TL;DR: This paper aims to demonstrate the efforts towards in-situ applicability of EMMARM, which aims to provide real-time information about the physical properties of EMTs and their application in the environment.
Abstract: a95550@eie.yz.yamagata-u.ac.jp carlos@translell.eco.tut.ac.jp tikemura@ddbj.nig.ac.jp 1 Department of Electrical Information Engineering, Faculty of Engineering, Yamagata University, Yonezawa, Yamagata 992-8510, Japan 2 Department of Ecological Engineering, Faculty of Engineering, Toyohashi University of Technology, Toyohashi, Aichi 441-8580, Japan 3 Department of Population Genetics, National Institute of Genetics, and the Graduate University for Advanced Studies, Mishima, Shizuoka, 441-8540, Japan. 4 Department of Developmental Genetics, National Institute of Genetics 5 CREST, JST (Japan Science and Technology)

Journal ArticleDOI
TL;DR: MUSCA is a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems whose solutions are used to obtain the alignment of the sequences and introduces the the notion of an alignment number K (2
Abstract: Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. MUSCA is a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems whose solutions are used to obtain the alignment of the sequences. We first discover motifs in the N sequences and then extract an appropriate subset of compatible motifs to obtain a good alignment. The motifs of interest to us are the irredundant motifs which are only polynomial in the input size. In practice, however, the number is much smaller (sub-linear). Notice that this step aids in a direct N-wise alignment, as opposed to composing the alignments from lower order (say pairwise) alignments and the solution is also independent of the order of the input sequences; hence the algorithm works very well while dealing with a large number of sequences. The second part of the problem that deals with obtaining a good alignment is solved using a graph-theoretic approach that computes an induced subgraph satisfying certain simple constraints. We reduce a version of this problem to that of solving an instance of a set covering problem, thus offer the best possible approximate solution to the problem (provided P not equalNP). Our experimental results, while being preliminary, indicate that this approach is efficient, particularly on large numbers of long sequences, and, gives good alignments when tested on biological data such as DNA and protein sequences. We introduce the the notion of an alignment number K (2

Journal ArticleDOI
TL;DR: The rough reading strategy which combines the experts' knowledge with the machine learning system BONSAI is proposed and an algorithm is devised which iterates the above procedure until almost all records of experts' interest are selected.
Abstract: We consider the problem of selecting the articles of experts' interest from a literature database with the assistance of a machine learning system. For this purpose, we propose the rough reading strategy which combines the experts' knowledge with the machine learning system. For the articles converted through the rough reading strategy, we employ the learning system BONSAI and apply it for discovering rules which may reduce the work of experts in selecting the articles. Furthermore, we devise an algorithm which iterates the above procedure until almost all records of experts' interest are selected. Experimental results by using the articles from Cell show that almost all records of experts' interest are selected while reducing the works of experts drastically.

Journal ArticleDOI
TL;DR: This work examines the non-local processes responsible for genome rearrangements such as inversion of arbitrarily long segments of chromosomes, and calculates the invariants for this process for N=5, and applies them to mitochondrial genome data from coelomate metazoans, showing how they resolve key aspects of branching order.
Abstract: The method of phylogenetic invariants was developed to apply to aligned sequence data generated, according to a stochastic substitution model, for N species related through an unknown phylogenetic tree. The invariants are functions of the probabilities of the observable N-tuples, which are identically zero, over all choices of branch length, for some trees. Evaluating the invariants associated with all possible trees, using observed N-tuple frequencies over all sequence positions, enables us to rapidly infer the generating tree. An aspect of evolution at the genomic level much studied recently is the rearrangements of gene order along the chromosome from one species to another. Instead of the substitutions responsible for sequence evolution, we examine the non-local processes responsible for genome rearrangements such as inversion of arbitrarily long segments of chromosomes. By treating the potential adjacency of each possible pair of genes as a position", an appropriate substitution" model can be recognized as governing the rearrangement process, and a probabilistically principled phylogenetic inference can be set up. We calculate the invariants for this process for N=5, and apply them to mitochondrial genome data from coelomate metazoans, showing how they resolve key aspects of branching order.

Journal ArticleDOI
TL;DR: The results indicate that, in eubacteria including two species of Mycoplasma, the operon structure of ribosomal protein genes is well conserved, while their relative orientation and chromosomal location are diverged into several classes.
Abstract: The complete genomic nucleotide sequence data of more than 10 unicellular organisms have become available. During the past years, we have been focusing our attention to the analysis of the structure and function of the ribosome and its protein components. By making use of the genomic sequence data, our work can now be extended to comparative analysis of the ribosomal components at the genomic level. Such analysis will contribute to our understanding of the structure-function relationship of the ribosome that is vital to the expression of genetic information. Bearing these in mind, the ribosomal protein genes of organisms whose genomic sequence data are available were analyzed, which included Aquifex aeolicus; Archaeoglobus fulgidus; Borrelia burgdorferi; Bacillus subtilis; Escherichia coli; Haemophilus influenzae; Helicobacter pylori; Methanococcus jannaschii; Mycoplasma genitalium; Mycoplasma pneumoniae; Synechosystis sp., and Saccharomyces cerevisiae. In addition, the amino acid sequence data of Bacillus stearothermophilus ribosomal proteins were used in the evolutionary evaluation. The results indicate that, in eubacteria including two species of Mycoplasma, the operon structure of ribosomal protein genes is well conserved, while their relative orientation and chromosomal location are diverged into several classes. The operon structure in M. jannaschii on the other hand is quite different from the eubacterial one and we noticed that its many genes show similarity to rat ribosomal protein genes. The degrees of sequence conservation differ from one ribosomal protein gene to another, but several genes encoding proteins that are considered to be of structural importance are conserved throughout the bacterial species including archaebacteria and further in S. cerevisiae.

Journal ArticleDOI
TL;DR: 3DinSight is an integrated database, search and visualization tool for structure, function and property of biomolecules, developed to help researchers to get insight into their relationship.
Abstract: In order to understand many biological phenomena, it is critical to get insight into the relationship among structure, function and property of biomolecules. However, it is usually di cult to infer the relation from individual data, since we usually need to examine several databases and literatures to obtain the necessary information. It would be useful to have an integrated database where one can examine the relationship among structure, function and property. There are some services available in the Internet to link various databases, but the relational information of biomolecular structure, function and property is rather scarce. 3DinSight is an integrated database, search and visualization tool for structure, function and property of biomolecules, developed to help researchers to get insight into their relationship [1]. Various kinds of searches can be carried out though WWW interfaces. The locations of motif sequences and mutations are automatically mapped on the structure, and visualized by interactive viewers, VRML (Virtual Reality Modeling Language) and RasMol, where the mapped 3D objects are hyper-linked to the corresponding document data. Also, the thermodynamic data of proteins and mutants are integrated into 3DinSight. The amino-acid properties of a molecule, together with structural and functional information, can be displayed as a graph plot. 3DinSight is freely accessible through the Internet (http://www.rtc.riken.go.jp/3DinSight.html).

Journal ArticleDOI
TL;DR: This work uses the modern data integration system, Kleisli, to bring out annotated features of BLASTP results and strengthens the solution by incorporating additional information from SEG, ClustalW, hmmPfam, etc.
Abstract: BLASTP gives a good overall indication of what function a protein might have. However, analysis of BLASTP reports to discover various domain features in the protein is still tedious. We address this problem by using the modern data integration system, Kleisli, to bring out annotated features of BLASTP results. We further strengthen our solution by incorporating additional information from SEG, ClustalW, hmmPfam, etc. It is also noteworthy that the codes of our implementation is sufficiently short to be presented in its entirety.

Journal ArticleDOI
TL;DR: This work introduces a new concept in prediction of RNA structures, and extends the hitherto existing secondary structure prediction systems into the next step i.e. the prediction of the tertiary structure of the macromolecule from the predicted secondary structure.
Abstract: Several attempts to predict automatically the RNA secondary structure have been performed in recent years[1,2]. These attempts can be divided in essentially two general approaches. The rst involves the overall free energy minimization by adding contributions from each base pair, bulged base, loop and other elements[1]. The second type of approach is more empirical and involves searching for the combination of non-exclusive helices with a maximum number of base pairing [2]. Within the latter, methods using DP (dynamic programming) are the most common [2,3]. Here we introduce a new concept in prediction of RNA structures, and we extend the hitherto existing secondary structure prediction systems into the next step i.e. the prediction of the tertiary structure of the macromolecule from the predicted secondary structure. This will allow the identi cation of receptor regions on the molecule as well as detailed evaluation of its biochemical an biological functions.

Journal ArticleDOI
TL;DR: This poster proposes a system named AIGNET (Algorithms for Inference of Genetic Networks), and introduces two top down approaches for inference of genetic networks, which rely on the analysis of state changes and/or temporal responses of gene expression patterns.
Abstract: Powerful new technologies, such as DNA microarrays, provide simple and economical ways to explore gene expression patterns on a genomic scale[1, 2]. Using comprehensive gene expression data, various approaches are planned to infer genetic networks [3, 4]. In this poster, we propose a system named AIGNET (Algorithms for Inference of Genetic Networks), and introduce two top down approaches for inference of genetic networks, which rely on the analysis of state changes and/or temporal responses of gene expression patterns. We show the strategy is exible and rich in structure.

Journal ArticleDOI
TL;DR: The model is constructed using E-CELL system, a generic software for simulation of cellular processes, andKinetic parameters of all the reactions are based on experimental data in the literature, and behavior of the simulated model cell is compared with that of the real red blood cell observed in laboratories.
Abstract: In this work, we try to model and simulate human red blood cell using E-CELL system, a generic software for simulation of cellular processes [1]. Human red blood cell has been well-studied in last three decades, and extensive biochemical data on its enzymes and metabolites have been accumulated [2]. The cell uptakes glucose from the environment and processes it through the glycolysis pathway, generating ATP molecules for other cellular metabolism. The ATP molecules are consumed mostly for cations transport in order to keep the cell's electroneutrality and osmotic balance. The cell also has several other pathways such as nucleotide metabolism and pentose phosphate pathway (Fig. 1). The model we have constructed using E-CELL contains 44 reactions and 43 intermediates. Kinetics of the most enzymatic reactions are modeled using various types of Michealis-Menten equations (Phosphoglucoisomerase, Triose phosphate isomerase, Adenine phosphoribosyl transferase, etc.). Kinetics of other enzymatic reactions such as 6-Phosphogluconate dehydrogenase Glutathione reductase and Transaldolase is modeled as Ordered Bi-Ter system and Ping-Pong Bi-Bi system, respectively. Other reactions use kinetic equations speci c to the reaction (ex, Hexokinase, Pyruvate kinase, etc.), nine of which are membrane transportation. Kinetic parameters of all the reactions are based on experimental data in the literature [2, 4]. We then compare behavior of the simulated model cell with that of the real red blood cell observed in laboratory experiments. We also compare our model with other computer models [3, 4, 5, 6, 7].

Journal ArticleDOI
TL;DR: This work systematically analyzed all overlapping genes in the genomes of two closely related species, Mycoplasma genitalium and MyCoplasma pneumoniae, to find the homologous genes that are overlapped in one species but not in the other.
Abstract: Many overlapping genes have been identi ed in the genomes of procaryotes, bacteriophages, animal viruses, and mitochondria, some of which have been reported to have functional roles [1], but their evolutionary origin is not clearly understood. We systematically analyzed all overlapping genes in the genomes of two closely related species, Mycoplasma genitalium [2] and Mycoplasma pneumoniae [3]. In particular, careful comparisons were made for the homologous genes that are overlapped in one species but not in the other.

Journal ArticleDOI
TL;DR: This work has successfully constructed a virtual cell with 127 genes sufficient for “self-support”, selected from the genome of Mycoplasma genitalium, the organism having the smallest known genome.
Abstract: The E-CELL project [1] was launched in 1996 at Keio University in order to model and simulate various cellular processes with the ultimate goal of simulating the cell as a whole. The first version of the E-CELL simulation system, which is a generic software package for cell modeling, was completed in 1997. The E-CELL system enables us to model not only metabolic pathways but also other higherorder cellular processes such as protein synthesis and membrane transport within the same framework. These various processes can then be integrated into a single simulation model. Using the E-CELL system, we have successfully constructed a virtual cell with 127 genes sufficient for “self-support”. The gene set was selected from the genome of Mycoplasma genitalium, the organism having the smallest known genome. The set includes genes for transcription, translation, the glycolysis pathway for energy production, membrane transport, and the phospholipid biosynthesis pathway for membrane structure.

Journal ArticleDOI
TL;DR: It is shown that short and clear programs can be written in Kleisli, using its high-level query language CPL, to build a TPR domain hunter by integrating WU-BLAST2.0, HMMER, Entrez, and PFAM.
Abstract: We have two objectives. First, we want to build a system for detecting tetratricopeptide repeats in protein sequences. Second, we want to demonstrate how the general bioinformatics database integration system called Kleisli can help build such a system easily. We achieve these two objectives by showing that short and clear programs can be written in Kleisli, using its high-level query language CPL, to build a TPR domain hunter by integrating WU-BLAST2.0, HMMER, Entrez, and PFAM.

Journal ArticleDOI
TL;DR: PTFD is able to identify promoters sandwiched between other sequences whether these are coding sequences, or non coding sequences between coding sequences with no known promoter activity, and is comparable to the kind of results obtained with Neural Network and Hidden Markov Model predictions.
Abstract: A simple, fast and sensitive model for identifying and predicting sequences which have non-random statistical properties and therefore biologically active, such as promoters has been developed. This statistical model, Penalized Triplet Frequency Distribution (PTFD) utilizes the information content of promoters (in triplets) and those of other set of sequences of di erent category e.g. coding sequences (also in triplets) to generate a hash table of scores for each of the 64 possible triplets. The hash table is unique for each set of promoter and nonpromoter sequences but generally similar in composition. Cumulative score and therefore the performance of each sequence is assessed by (a) opening a 3bp window and moving along the sequence one bp at a time to extract all the triplets ; (b) obtaining each triplet's corresponding hash table value and (c) summing up all the hash table values of the triplets found in the sequence. A cut-o value obtained by implementing the model on test promoter sequences is used to predict promoters from non promoters. Our prediction results using Penalized Triplet Frequency Distribution (PTFD) method are consistently around 93% True Positives (TP) and 10-14% False Positives (FP). These results are comparable to the kind of results obtained with Neural Network and Hidden Markov Model predictions of promoters from non-promoters (results not shown). Our method is in addition, able to identify promoters sandwiched between other sequences whether these are coding sequences, or non coding sequences between coding sequences with no known promoter activity.

Journal ArticleDOI
TL;DR: This paper focuses on how the GENES database is constructed and how gene annotations are made to maintain consistent information among species.
Abstract: The KEGG (Kyoto Encyclopedia of Genes and Genomes) project is accumulating and computerizing vast knowledge of biochemistry and molecular biology. In addition to the PATHWAY database, the GENES database is a basic element of the KEGG system. It describes gene IDs, gene names, product names, chromosomal positions, gene classifications, EC numbers, codon frequencies, amino acid sequences and nucleotide sequences of over 20 species mainly with complete genomes. Since the GENES database provides links between genomes and pathways, it can be an important tool for functional genomics, where information of both genomes and pathways should be utilized. It can also be used for comparative genomics, where whole genomes from several species are analyzed at a time. A major problem in such analyses is that there is no standard nomenclature for representing genes and gene products. The names of the genes and gene products are often different between species even if the products have the same function. The discrepancies are conspicuous when the genomes are sequenced by different organizations. The construction of the GENES database is an effort to fill the gap and to provide a standardized resource for functional and comparative genomics. Last year, we reported main concepts of the GENES database [4]. This paper focuses on how the GENES database is constructed and how gene annotations are made to maintain consistent information among species.

Journal ArticleDOI
TL;DR: A web interface for DDGEL is newly developed, using whichDDGEL can be easily accessed from standard PCs using standard web browsers such as NETSCAPE and INTERNET EXPLORER.
Abstract: Recently, a method called RLGS (Restriction Landmark Genomic Scanning) has been developed in order to detect and analyze the genetic alterations by observing the entire genomic DNA after separating DNA fragments in a single two-dimensional slab gel [1]. To analyze gel images obtained by the RLGS method, a lot of tasks must be done: thousands of spots must be detected where each spot corresponds to particular genetic landmark; a correspondence of spots between two images must be detected; links between spots and genetic information must be classi ed and stored in a database. We have been developing a software tool named DDGEL so that such tasks can be done automatically or semi-automatically [2], where similar systems have been developed by other groups [3, 4]. Previously, user interface of DDGEL was implemented using the X-WINDOW (X11R6) system. However, it is hard to install X-WINDOW on standard PCs (personal computers) and thus limited users (i.e., users having UNIX workstations) can use DDGEL. In order to make DDGEL more user-friendly, we have newly developed a web interface for DDGEL, using which DDGEL can be easily accessed from standard PCs using standard web browsers such as NETSCAPE and INTERNET EXPLORER.

Journal ArticleDOI
TL;DR: A new library of ligand binding sites is reported, which contains information of interacting atomic pairs between proteins and ligands, and other information such as indirect interaction caused by water molecules or metal ions.
Abstract: Under the Human Genome Project, all 100,000 genes of the human genome will be sequenced along with bacterial and other genomes. This vast amount of information will become essential to discover new enzyme-substrate interactions or to design new ligands that can be used as drugs and in other applications. The elucidation of protein structures lags behind the determination of protein sequences. The homology modeling and other efficient techniques can rapidly produce approximate 3-D structures of target proteins, but they cannot readily be used for drug design since obtained protein 3-D structure models are imprecise. Combinatorial libraries of chemical compounds are useful as practical screening methods, but it is necessary to develop novel computational methods for understanding principles of molecular varieties. For small molecules that interact with proteins, such as GTP and NAD, their binding sites are usually well characterized and defined. Thus, collection and organization of local structural knowledge, including the information of ligand binding sites of proteins and atomic locations of ligands, must be precious resources that could be used computationally for drug design or in other applications. We report here a new library of ligand binding sites, which contains information of interacting atomic pairs between proteins and ligands, and other information such as indirect interaction caused by water molecules or metal ions.

Journal ArticleDOI
TL;DR: A new addition of the ortholog/paralog group table for PTS is presented, which contains a large number of fusion proteins that makes it di cult to cluster genes for functional grouping and to perform multiple alignment for identifying functional residue.
Abstract: The identi cation of the functions for all proteins in the genome is important to understand the features of the organism. With the increasing amount of complete genome sequences, the approaches based on comparative genomics have become particularly useful. There are attempts to automatically construct the ortholog table containing orthologous relations of genes in di erent organisms, but the complexity of organisms often requires manual e orts to add or remove speci c cases and to improve the quality of data. In order to facilitate the process of constructing ortholog tables we have been focusing on speci c aspects of protein functions [1, 2] rather than trying to cover the entire spectrum. Here we present a new addition of the ortholog/paralog group table for PTS. The phosphoenolpyruvate:carbohydorete phosphotransferase systes (PTSs) are both transport and sensing systems in gram-negative and gram-positive bacteria. They take in and phosphorylate a large number of carbohydrates, and play as signal transducers to move to these carbon sources. In general they have unique gene structures in the genome. The PTS comprises 5 or 6 components (EI, HPr, EIIA, EIIB, EIIC, EIID) where EI and HPr are common components and there are multiple components (paralogs) of EIIs depending on substrates. The EII components often form gene clusters in the genome and some of them (EIIA, EIIB, EIIC) sometimes fuse into one protein. There are many variations in the order of the components and the fusion protein in the gene cluster. In contrast to the bacterial ABC transport system [1], the PTS contains a large number of \\fusion of components\" that makes it di cult to cluster genes for functional grouping and to perform multiple alignment for identifying functional residue. In order to cope with this di culty, the fusion proteins were divided into the components by the homology search against our collection of known components.