scispace - formally typeset
Search or ask a question
JournalISSN: 0919-9454

Genome Informatics 

Imperial College Press
About: Genome Informatics is an academic journal. The journal publishes majorly in the area(s): Genome & Gene. It has an ISSN identifier of 0919-9454. Over the lifetime, 1517 publications have been published receiving 15503 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: An improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix is created, and a Python and a Perl interface to the C Clustering Library is generated, thereby combining the flexibility of a scripting language with the speed of C.
Abstract: SUMMARY We have implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs. Using this library, we have created an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. In addition, we generated a Python and a Perl interface to the C Clustering Library, thereby combining the flexibility of a scripting language with the speed of C. AVAILABILITY The C Clustering Library and the corresponding Python C extension module Pycluster were released under the Python License, while the Perl module Algorithm::Cluster was released under the Artistic License. The GUI code Cluster 3.0 for Windows, Macintosh and Linux/Unix, as well as the corresponding command-line program, were released under the same license as the original Cluster code. The complete source code is available at http://bonsai.ims.u-tokyo.ac.jp/mdehoon/software/cluster. Alternatively, Algorithm::Cluster can be downloaded from CPAN, while Pycluster is also available as part of the Biopython distribution.

1,493 citations

Journal ArticleDOI
TL;DR: Overall, intrinsic disorder appears to be a common, with eucaryotes perhaps having a higher percentage of native disorder than archaea or bacteria, and bacteria and archaea in various archaea ranged from 2 to 11%, plus an apparently anomalous 18% in bacteria.
Abstract: Intrinsic protein disorder refers to segments or to whole proteins that fail to fold completely on their own. Here we predicted disorder on protein sequences from 34 genomes, including 22 bacteria, 7 archaea, and 5 eucaryotes. Predicted disordered segments > or = 50, > or = 40, and > or = 30 in length were determined as well as proteins estimated to be wholly disordered. The five eucaryotes were separated from bacteria and archaea by having the highest percentages of sequences predicted to have disordered segments > or = 50 in length: from 25% for Plasmodium to 41% for Drosophila. Estimates of wholly disordered proteins in the bacteria ranged from 1% to 8%, averaging to 3 +/- 2%, estimates in various archaea ranged from 2 to 11%, plus an apparently anomalous 18%, averaging to 7 +/- 5% that drops to 5 +/- 3% if the high value is discarded. Estimates in the 5 eucarya ranged from 3 to 17%. The putative wholly disordered proteins were often ribosomal proteins, but in addition about equal numbers were of known and unknown function. Overall, intrinsic disorder appears to be a common, with eucaryotes perhaps having a higher percentage of native disorder than archaea or bacteria.

642 citations

Journal ArticleDOI
Li X1, Pedro Romero1, M Rani1, A. K. Dunker1, Zoran Obradovic1 
TL;DR: Logistic regression, discriminant analysis, and neural networks were used to predict ordered and disordered regions in proteins to support the hypothesis that disorder is encoded by the amino acid sequence.
Abstract: Logistic regression (LR), discriminant analysis (DA), and neural networks (NN) were used to predict ordered and disordered regions in proteins. Training data were from a set of non-redundant X-ray crystal structures, with the data being partitioned into N-terminal, C-terminal and internal (I) regions. The DA and LR methods gave almost identical 5-cross validation accuracies that averaged to the following values: 75.9 +/- 3.1% (N-regions), 70.7 +/- 1.5% (I-regions), and 74.6 +/- 4.4% (C-regions). NN predictions gave slightly higher scores: 78.8 +/- 1.2% (N-regions), 72.5 +/- 1.2% (I-regions), and 75.3 +/- 3.3% (C-regions). Predictions improved with length of the disordered regions. Averaged over the three methods, values ranged from 52% to 78% for length = 9-14 to >/= 21, respectively, for I-regions, from 72% to 81% for length = 5 to 12-15, respectively, for N-regions, and from 70% to 80% for length = 5 to 12-15, respectively, for C-regions. These data support the hypothesis that disorder is encoded by the amino acid sequence.

540 citations

Journal ArticleDOI
TL;DR: This work presents a comparative study on six feature selection heuristics by applying them to two sets of data, which are gene expression profiles from Acute Lymphoblastic Leukemia and proteomic patterns from ovarian cancer patients.
Abstract: Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on features chosen by these methods, error rates of several classification algorithms were obtained for analysis. Our results demonstrate the importance of feature selection in accurately classifying new samples.

455 citations

Journal ArticleDOI
TL;DR: The largest alignments of amino acid sequence data to date are constructed and a good case is made for the tree shrew as a closer relative of primates than rodents, while also showing a slower rate of evolution in key cell cycle genes.
Abstract: A major effort is being undertaken to sequence an array of mammalian genomes. Coincidentally, the evolutionary relationships of the 18 presently recognized orders of placental mammals are only just being resolved. In this work we construct and analyse the largest alignments of amino acid sequence data to date. Our findings allow us to set up a series of superordinal groups (clades) to act as prior hypotheses for further testing. Important findings include strong evidence for a clade of Euarchonta+Glires (=Supraprimates) comprised of primates, flying lemurs, tree shrews, lagomorphs and rodents. In addition, there is good evidence for a clade of all placental mammals except Xenarthra and Afrotheria (=Boreotheria) and for the previously recognised clades Laurasiatheria, Scrotifera, Fereuungulata, Ferae, Afrotheria, Euarchonta, Glires, and Eulipotyphla. Accordingly, a revised classification of the placental mammals is put forward. Using this and molecular divergence-time methods, the ages of the superordinal splits are estimated. While results are strongly consistent with the earliest superordinal divergences all being > 65 mybp (Cretaceous period), they suffer from greater uncertainty than presently appreciated. The early primate split of tarsiers from the anthropoid lineage at ∼55 mybp is seen to be an especially informative fossil calibration point. A statistical framework for testing clades using SINE data is presented and reveals significant support for the tarsier/anthropoid clade, as well as the clades Cetruminantia and Whippomorpha. Results also underline our thesis that while sequence analysis can help set up hypothesised clades, SINEs obtainable from sequencing 1-2 MB regions of placental genomes are essential to testing them. In contrast, derivations suggest that empirical Bayesian methods for sequence data may not be robust estimators of clades. Our findings, including the study of genes such as TP53, make a good case for the tree shrew as a closer relative of primates than rodents, while also showing a slower rate of evolution in key cell cycle genes. Tree shrews are consequently high value experimental animals and a strong candidate for a genome sequencing initiative.

273 citations

Network Information
Related Journals (5)
Bioinformatics
17.4K papers, 2.1M citations
86% related
BMC Bioinformatics
11.9K papers, 642K citations
86% related
Nucleic Acids Research
48.8K papers, 4.7M citations
83% related
Proteins
8K papers, 447.3K citations
83% related
Genome Research
5.5K papers, 931.7K citations
80% related
Performance
Metrics
No. of papers from the Journal in previous years
YearPapers
20141
20115
20105
20092
200847
200748