scispace - formally typeset
Search or ask a question

Showing papers in "Briefings in Bioinformatics in 2004"


Journal ArticleDOI
TL;DR: An overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA is provided.
Abstract: With its theoretical basis firmly established in molecular evolutionary and population genetics, the comparative DNA and protein sequence analysis plays a central role in reconstructing the evolutionary histories of species and multigene families, estimating rates of molecular evolution, and inferring the nature and extent of selective forces shaping the evolution of genes and genomes. The scope of these investigations has now expanded greatly owing to the development of high-throughput sequencing techniques and novel statistical and computational methods. These methods require easy-to-use computer programs. One such effort has been to produce Molecular Evolutionary Genetics Analysis (MEGA) software, with its focus on facilitating the exploration and analysis of the DNA and protein sequence variation from an evolutionary perspective. Currently in its third major release, MEGA3 contains facilities for automatic and manual sequence alignment, web-based mining of databases, inference of the phylogenetic trees, estimation of evolutionary distances and testing evolutionary hypotheses. This paper provides an overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA.

12,124 citations


Journal ArticleDOI
TL;DR: Some of the aspects of Swiss-Prot that make it unique are described, what are the developments necessary for the database to continue to play its role as a focal point of protein knowledge are explained, and advice is provided pertinent to the development of high-quality knowledge resources.
Abstract: We describe some of the aspects of Swiss-Prot that make it unique, explain what are the developments we believe to be necessary for the database to continue to play its role as a focal point of protein knowledge, and provide advice pertinent to the development of high-quality knowledge resources on one aspect or the other of the life sciences.

372 citations


Journal ArticleDOI
TL;DR: This tool can save time and enhance analysis but it requires some learning on the user's part and there are some issues that need to be addressed by the developer.
Abstract: Vector NTI is a well-balanced desktop application integrated for molecular sequence analysis and biological data management. It has a centralised database and five application modules: Vector NTI, AlignX, BioAnnotator, ContigExpress and GenomBench. In this review, the features and functions available in this software are examined. These include database management, primer design, virtual cloning, alignments, sequence assembly, 3D molecular viewer and internet tools. Some problems encountered when using this software are also discussed. It is hoped that this review will introduce this software to more molecular biologists so they can make better-informed decisions when choosing computational tools to facilitate their everyday laboratory work. This tool can save time and enhance analysis but it requires some learning on the user's part and there are some issues that need to be addressed by the developer.

340 citations


Journal ArticleDOI
TL;DR: A novel algorithm for comparative genome assembly that can accurately assemble a typical bacterial genome in less than four minutes on a standard desktop computer is described.
Abstract: One of the most complex and computationally intensive tasks of genome sequence analysis is genome assembly. Even today, few centres have the resources, in both software and hardware, to assemble a genome from the thousands or millions of individual sequences generated in a whole-genome shotgun sequencing project. With the rapid growth in the number of sequenced genomes has come an increase in the number of organisms for which two or more closely related species have been sequenced. This has created the possibility of building a comparative genome assembly algorithm, which can assemble a newly sequenced genome by mapping it onto a reference genome. We describe here a novel algorithm for comparative genome assembly that can accurately assemble a typical bacterial genome in less than four minutes on a standard desktop computer. The software is available as part of the open-source AMOS project.

259 citations


Journal ArticleDOI
TL;DR: The principles of SVM and the applications of SVMs to the analysis of biological data, mainly protein and DNA sequences are discussed.
Abstract: One of the major tasks in bioinformatics is the classification and prediction of biological data. With the rapid increase in size of the biological databanks, it is essential to use computer programs to automate the classification process. At present, the computer programs that give the best prediction performance are support vector machines (SVMs). This is because SVMs are designed to maximise the margin to separate two classes so that the trained model generalises well on unseen data. Most other computer programs implement a classifier through the minimisation of error occurred in training, which leads to poorer generalisation. Because of this, SVMs have been widely applied to many areas of bioinformatics including protein function prediction, protease functional site recognition, transcription initiation site prediction and gene expression data classification. This paper will discuss the principles of SVMs and the applications of SVMs to the analysis of biological data, mainly protein and DNA sequences.

187 citations


Journal ArticleDOI
TL;DR: This review aims to explain basic LD measures and their application properties, with microsatellites having a greater power to detect LD in population isolates than SNPs.
Abstract: Assessing the patterns of linkage disequilibrium (LD) has become an important issue in both evolutionary biology and medical genetics since the rapid accumulation of densely spaced DNA sequence variation data in several organisms. LD deals with the correlation of genetic variation at two or more loci or sites in the genome within a given population. There are a variety of LD measures which range from traditional pairwise LD measures such as D' or r2 to entropy-based multi-locus measures or haplotype-specific approaches. Understanding the evolutionary forces (in particular recombination) that generate the observed variation of LD patterns across genomic regions is addressed by model-based LD analysis. Marker type and its allelic composition also influence the observed LD pattern, microsatellites having a greater power to detect LD in population isolates than SNPs. This review aims to explain basic LD measures and their application properties.

179 citations


Journal ArticleDOI
TL;DR: While Sequencher impresses with a very user-friendly interface and easy-to-use tools, BioEdit offers the largest and most customisable variety of tools.
Abstract: Programs to import, manage and align sequences and to analyse the properties of DNA, RNA and proteins are essential for every biological laboratory. This review describes two different freeware (BioEdit and pDRAW for MS Windows) and a commercial program (Sequencher for MS Windows and Apple MacOS). Bioedit and Sequencher offer functions such as sequence alignment and editing plus reading of sequence trace files. pDRAW is a very comfortable visualisation tool with a variety of analysis functions. While Sequencher impresses with a very user-friendly interface and easy-to-use tools, BioEdit offers the largest and most customisable variety of tools. The strength of pDRAW is drawing and analysis of single sequences for priming and restriction sites and virtual cloning. It has a database function for user-specific oligonucleotides and restriction enzymes.

165 citations


Journal ArticleDOI
TL;DR: The syntax, operations, infrastructure compatibility considerations, use cases and potential future applications of LSID and LSRS are described, seen as important steps toward simpler, more elegant and more reliable integration of the world's biological knowledgebases, and as facilitating stronger global collaboration in biology.
Abstract: The World-Wide Web provides a globally distributed communication framework that is essential for almost all scientific collaboration, including bioinformatics. However, several limits and inadequacies have become apparent, one of which is the inability to programmatically identify locally named objects that may be widely distributed over the network. This shortcoming limits our ability to integrate multiple knowledgebases, each of which gives partial information of a shared domain, as is commonly seen in bioinformatics. The Life Science Identifier (LSID) and LSID Resolution System (LSRS) provide simple and elegant solutions to this problem, based on the extension of existing internet technologies. LSID and LSRS are consistent with next-generation semantic web and semantic grid approaches. This article describes the syntax, operations, infrastructure compatibility considerations, use cases and potential future applications of LSID and LSRS. We see the adoption of these methods as important steps toward simpler, more elegant and more reliable integration of the world’s biological knowledgebases, and as facilitating stronger global collaboration in biology.

158 citations


Journal ArticleDOI
TL;DR: A survey of existing methods proposed for the identification of transcription factor binding sites in the regulatory regions of co-expressed genes, focusing both on the ideas underlying them and their availability to the scientific community is provided.
Abstract: Understanding the complex mechanisms governing basic biological processes requires the characterisation of regulatory motifs modulating gene expression at transcriptional and posttranscriptional level. In particular, extent, chronology and cell-specificity of transcription are modulated by the interaction of transcription factors with their corresponding binding sites, mostly located near (or sometimes quite far away from) the transcription start site of the gene. The constantly growing amount of genomic data, complemented by other sources of information such as expression data derived from microarray experiments, has opened new opportunities to researchers in this field. Many different methods have been proposed for the identification of transcription factor binding sites in the regulatory regions of co-expressed genes: unfortunately this is a very challenging problem both from the computational and the biological viewpoint. This paper provides a survey of existing methods proposed for the problem, focusing both on the ideas underlying them and their availability to the scientific community.

105 citations


Journal ArticleDOI
TL;DR: Since its inception ten years ago, the Saccharomyces Genome Database has seen a dramatic increase in its usage, has developed and maintained a positive working relationship with the yeast research community, and has served as a template for at least one other database.
Abstract: A scientific database can be a powerful tool for biologists in an era where large-scale genomic analysis, combined with smaller-scale scientific results, provides new insights into the roles of genes and their products in the cell. However, the collection and assimilation of data is, in itself, not enough to make a database useful. The data must be incorporated into the database and presented to the user in an intuitive and biologically significant manner. Most importantly, this presentation must be driven by the user's point of view; that is, from a biological perspective. The success of a scientific database can therefore be measured by the response of its users - statistically, by usage numbers and, in a less quantifiable way, by its relationship with the community it serves and its ability to serve as a model for similar projects. Since its inception ten years ago, the Saccharomyces Genome Database (SGD) has seen a dramatic increase in its usage, has developed and maintained a positive working relationship with the yeast research community, and has served as a template for at least one other database. The success of SGD, as measured by these criteria, is due in large part to philosophies that have guided its mission and organisation since it was established in 1993. This paper aims to detail these philosophies and how they shape the organisation and presentation of the database.

101 citations


Journal ArticleDOI
TL;DR: The aim is to create a cell-by-cell catalogue of glycosyltransferase expression and detected glycan structures to accept that unrestricted dissemination of scientific data accelerates scientific findings and initiates a number of new initiatives to explore the data.
Abstract: The term ‘glycomics’ describes the scientific attempt to identify and study all the glycan molecules – the glycome – synthesised by an organism. The aim is to create a cell-by-cell catalogue of glycosyltransferase expression and detected glycan structures. The current status of databases and bioinformatics tools, which are still in their infancy, is reviewed. The structures of glycans as secondary gene products cannot be easily predicted from the DNA sequence. Glycan sequences cannot be described by a simple linear one-letter code as each pair of monosaccharides can be linked in several ways and branched structures can be formed. Few of the bioinformatics algorithms developed for genomics/proteomics can be directly adapted for glycomics. The development of algorithms, which allow a rapid, automatic interpretation of mass spectra to identify glycan structures is currently the most active field of research. The lack of generally accepted ways to normalise glycan structures and exchange glycan formats hampers an efficient cross-linking and the automatic exchange of distributed data. The upcoming glycomics should accept that unrestricted dissemination of scientific data accelerates scientific findings and initiates a number of new initiatives to explore the data.

Journal ArticleDOI
TL;DR: Various computational approaches for identification of conserved gene strings and construction of local alignments of gene orders in prokaryotic genomes are discussed.
Abstract: Gene order in prokaryotes is conserved to a much lesser extent than protein sequences. Only some operons, primarily those that encode physically interacting proteins, are conserved in all or most of the bacterial and archaeal genomes. Nevertheless, even the limited conservation of operon organisation that is observed provides valuable evolutionary and functional clues through multiple genome comparisons. With the rapid growth in the number and diversity of sequenced prokaryotic genomes, functional inferences for uncharacterised genes located in the same conserved gene neighborhood with well-studied genes are becoming increasingly important. In this review, we discuss various computational approaches for identification of conserved gene strings and construction of local alignments of gene orders in prokaryotic genomes.

Journal ArticleDOI
Don Gilbert1
TL;DR: This review looks at internet archives, repositories and lists for obtaining popular and useful biology and bioinformatics software.
Abstract: This review looks at internet archives, repositories and lists for obtaining popular and useful biology and bioinformatics software. Resources include collections of free software, services for the collaborative development of new programs, software news media and catalogues of links to bioinformatics software and web tools. Problems with such resources arise from needs for continued curator effort to collect and update these, combined with less than optimal community support, funding and collaboration. Despite some problems, the available software repositories provide needed public access to many tools that are a foundation for analyses in bioscience research efforts.

Journal ArticleDOI
TL;DR: The need for a more formal handling of biological information processing with stochastic and mobile process algebras is addressed and new computational models inspired by nature are obtained.
Abstract: The need for a more formal handling of biological information processing with stochastic and mobile process algebras is addressed. Biology can benefit this approach, yielding a better understanding of behavioural properties of cells, and computer science can benefit this approach, obtaining new computational models inspired by nature.

Journal ArticleDOI
TL;DR: It is shown that the Bayesian method implemented in GeneMark can be augmented and reintroduced as a rigorous forward-backward (FB) algorithm for local posterior decoding described in the HMM theory.
Abstract: In this paper, we review developments in probabilistic methods of gene recognition in prokaryotic genomes with the emphasis on connections to the general theory of hidden Markov models (HMM). We show that the Bayesian method implemented in GeneMark, a frequently used gene-finding tool, can be augmented and reintroduced as a rigorous forward-backward (FB) algorithm for local posterior decoding described in the HMM theory. Another earlier developed method, prokaryotic GeneMark.hmm, uses a modification of the Viterbi algorithm for HMM with duration to identify the most likely global path through hidden functional states given the DNA sequence. GeneMark and GeneMark.hmm programs are worth using in concert for analysing prokaryotic DNA sequences that arguably do not follow any exact mathematical model. The new extension of GeneMark using the FB algorithm was implemented in the software program GeneMark.fba. Given the DNA sequence, this program determines an a posteriori probability for each nucleotide to belong to coding or non-coding region. Also, for any open reading frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden states that traverse the ORF as a coding region. The prediction accuracy of GeneMark.fba determined in our tests was compared favourably to the accuracy of the initial (standard) GeneMark program. Comparison to the prokaryotic GeneMark.hmm has also demonstrated a certain, yet species-specific, degree of improvement in raw gene detection, ie detection of correct reading frame (and stop codon). The accuracy of exact gene prediction, which is concerned about precise prediction of gene start (which in a prokaryotic genome unambiguously defines the reading frame and stop codon, thus, the whole protein product), still remains more accurate in GeneMarkS, which uses more elaborate HMM to specifically address this task.

Journal ArticleDOI
TL;DR: This paper reviews the computational methods currently available to analyse host-parasite relationships and recommends two new approaches, reconciled trees and other model-based methods, as implemented in the program TreeMap.
Abstract: Computational aspects of host-parasite phylogenies form part of a set of general associations between areas and organisms, hosts and parasites, and species and genes. The problem is not new and the commonalities of exploring vicariance biogeography (organisms tracking areas) and host-parasite co-speciation (parasites tracking hosts) have been recognised for some time. Methods for comparing host-parasite phylogenies are now well established and fall within two basic categories defined in terms of the way the data are interpreted in relation to the comparison of host-parasite phylogenies, so-called a posteriori, eg Brooks' Parsimony Analysis (BPA), or a priori, eg reconciled trees and other model-based methods, as implemented in the program TreeMap; the relative merits of the two philosophies inherent in these two approaches remain hotly debated. This paper reviews the computational methods currently available to analyse host-parasite relationships.

Journal ArticleDOI
TL;DR: A review of An Essential Guide to the Basic Local Alignment Search Tool: BLAST by Ian Korf, Mark Yandell and Joseph Bledell.
Abstract: Review of An Essential Guide to the Basic Local Alignment Search Tool: BLAST by Ian Korf, Mark Yandell and Joseph Bledell

Journal ArticleDOI
TL;DR: Briefings in Bioinformatics has an opportunity to provide a forum for the maintainers of successful database projects to publish analyses of their approach, and to reveal the general lessons they have learned about the task of creating a successful biological database.
Abstract: In this issue, we are happy to present a collection of four articles that discuss the key factors in building a successful biological database. The bioinformatics community does many things, but we can roughly summarise most activities as either building algorithms or building databases. Successful algorithms are measured in terms of their time and space complexity, and there are well-known standards for communicating and validating new algorithms. Databases are much more difficult to validate. How do we know that a database is good? Do we judge it by the quality or quantity of its data, its usability and its ability to integrate with outside resources? Should we also consider the technical architecture and the user support mechanisms and documentation? What about the impact on science, the number of web hits, and total gigabytes of data that a database transfers? Of course, we probably should include all these factors in deciding if a database is a success. Anyone who builds and maintains biological databases also knows that it is very hard to publish traditional academic papers about databases. Familiar refrains include 'What's the hypothesis?', 'What's the evaluation scheme?', and 'This is not science, this is engineering'. In partial response to this difficulty, there is the annual special 'database issue' of Nucleic Acids Research, and the occasional Application Note in the back of the Bioinformatics journal, and other similar publications. But these articles are always severely limited in page length, and can sometimes read as database advertisements, written to attract potential users. We believe that Briefings in Bioinformatics has an opportunity to provide a forum for the maintainers of successful database projects to publish analyses of their approach, and to reveal the general lessons they have learned about the task of creating a successful biological database. Although many issues in creating a good database may transcend biology and be valid for all domains, there are special circumstances around biological databases that make them worth treating as a special group (eg the presumptions that they be freely available, that they be on the internet, that they keep up with a rapidly growing field and that they maintain high biological relevance). To assemble this special issue, we solicited manuscripts from the scientists associated with a shortlist of databases that the editors and their consultants felt could be called 'successful' with little controversy. These ranged over a variety of data types, but concentrated on databases …

Journal ArticleDOI
TL;DR: This paper reviews a variety of different graphical notations currently in active use for modelling dynamic processes in bioinformatics and biotechnology, and crystallises a set of properties essential to any proposal for a modelling language seeking to provide an adequate systemic description of biological processes.
Abstract: This paper reviews a variety of different graphical notations currently in active use for modelling dynamic processes in bioinformatics and biotechnology, and crystallises from these notations a set of properties essential to any proposal for a modelling language seeking to provide an adequate systemic description of biological processes.

Journal ArticleDOI
TL;DR: The two integrated annotation methods used by FANTOM are reviewed: one-by-one and categorised, which will be the most utilised method for integration of the genome and the transcriptome from now on.
Abstract: The key to reliable annotation of a mammalian genome is broad characterisation of the transcriptional output, the transcriptome. FANTOM, the functional annotation of mouse cDNA, is a large-scale analysis of both the genome and the transcriptome of the mouse. In the early days of this work, the transcripts were characterised using our sophisticated methods. After the timely release of the first draft of mouse genome sequences, interesting information was obtained by its integration with these one-by-one annotations. Moreover, each transcript included its expression profile. Here, the two integrated annotation methods used by FANTOM are reviewed: one-by-one and categorised. One-by-one annotation refers to naming carried out based on well-known transcripts or its fragments using the top-down-style pipeline developed mostly by the FANTOM project. Categorised annotation, which refers to transcript grouping, not only helps naming of unknown transcripts, but will be the most utilised method for integration of the genome and the transcriptome from now on.

Journal ArticleDOI
TL;DR: The different definitions of protein domains are clarified, and the available public databases with domain boundary information are described, to review existing domain boundary prediction methods and discuss their strengths and weaknesses.
Abstract: The delineation of domain boundaries of a given sequence in the absence of known 3D structures or detectable sequence homology to known domains benefits many areas in protein science, such as protein engineering, protein 3D structure determination and protein structure prediction. With the exponential growth of newly determined sequences, our ability to predict domain boundaries rapidly and accurately from sequence information alone is both essential and critical from the viewpoint of gene function annotation. Anyone attempting to predict domain boundaries for a single protein sequence is invariably confronted with a plethora of databases that contain boundary information available from the internet and a variety of methods for domain boundary prediction. How are these derived and how well do they work? What definition of 'domain' do they use? We will first clarify the different definitions of protein domains, and then describe the available public databases with domain boundary information. Finally, we will review existing domain boundary prediction methods and discuss their strengths and weaknesses.

Journal ArticleDOI
TL;DR: The Protein Data Bank is a widely used biological database of macromolecular structures with a long history that is treated as lessons learned and is used to highlight what are believed to be the best practices important to developers of biological databases today.
Abstract: The Protein Data Bank (PDB) is a widely used biological database of macromolecular structures with a long history. This history is treated as lessons learned and is used to highlight what are believed to be the best practices important to developers of biological databases today. While the focus is on data quality, data representation and the information technology to support these data, the non-data and technology issues cannot be ignored. The role of the human factor in the form of users, collaborators, scientific society and ad hoc committees is also included.

Journal ArticleDOI
TL;DR: The experience of building biological databases has led us to emphasise simplicity and conservative technology choices when building these databases, which have most aspects in common with other complex databases in other fields.
Abstract: We present our experience of building biological databases. Such databases have most aspects in common with other complex databases in other fields. We do not believe that biological data are that different from complex data in other fields. Our experience has led us to emphasise simplicity and conservative technology choices when building these databases. This is a short paper of advice that we hope is useful to people designing their own biological database.

Journal ArticleDOI
TL;DR: The central challenges in the analysis of large data sets, and how they might be overcome, are discussed and a summary of other important methods from the literature is provided.
Abstract: Large heterogeneous expression data comprising a variety of cellular conditions hold the promise of a global view of transcriptional regulation. While standard analysis methods have been successfully applied to smaller data sets, large-scale data pose specific challenges that have prompted the development of new and more sophisticated approaches. This paper focuses on one such approach (the Signature Algorithm) and discusses the central challenges in the analysis of large data sets, and how they might be overcome. Biological questions that have been addressed using the Signature Algorithm are highlighted and a summary of other important methods from the literature is provided.

Journal ArticleDOI
TL;DR: This report aims at dissecting each of these degrees to determine where the differences lie, to give the prospective students an idea as to which degree suits their career goals and to give an overview of the pedagogy of Australian bioinformatics education.
Abstract: Bioinformatics has been a hot topic in Australia's biotechnology circles for the past five years. As with biotechnology in the 1990s, there has been a sudden increase in the number of Bioinformatics undergraduate degrees. For students in the 2005 intake there are six undergraduate Bioinformatics degrees to choose from and another five Bioinformatics streams within a Bachelor of Science degree. The courses vary from three to four years of full-time study. This report aims at dissecting each of these degrees to determine where the differences lie, to give the prospective students an idea as to which degree suits their career goals and to give an overview of the pedagogy of Australian bioinformatics education.


Journal ArticleDOI
TL;DR: Since the mouse is used as a model organism to study human genes and their disease associations, this review focuses on information extraction and collation that captures the functional context of repeats in mouse transcripts to facilitate the biological interpretation and extrapolation of findings to the human.
Abstract: The back-to-back release of the mouse genome and the functionally annotated RIKEN mouse full-length cDNA collection was an important milestone in mammalian genomics. Yet much of the data remain to be explored in terms of biological effects and mechanisms. For example, interspersed repeats account for 39 per cent of the mouse genome sequence and 11 per cent of representative transcripts. A considerable number of transposable repeat elements are still active and propagating in mouse compared with human. While existing repeat databases and tools assist the classification of repeats or identification of new repeats, there is little bioinformatic support towards exploring the extent and role of repeats in transcriptional variation, modulation of protein function, or gene regulatory events. Since the mouse is used as a model organism to study human genes and their disease associations, this review focuses on information extraction and collation that captures the functional context of repeats in mouse transcripts to facilitate the biological interpretation and extrapolation of findings to the human.