scispace - formally typeset
Search or ask a question

Showing papers in "Briefings in Bioinformatics in 2002"


Journal ArticleDOI
TL;DR: The close relationship of PROSite with the SWISS-PROT protein database allows the evaluation of the sensitivity and specificity of the PROSITE motifs and their periodic reviewing.
Abstract: Among the various databases dedicated to the identification of protein families and domains, PROSITE is the first one created and has continuously evolved since. PROSITE currently consists of a large collection of biologically meaningful motifs that are described as patterns or profiles, and linked to documentation briefly describing the protein family or domain they are designed to detect. The close relationship of PROSITE with the SWISS-PROT protein database allows the evaluation of the sensitivity and specificity of the PROSITE motifs and their periodic reviewing. In return, PROSITE is used to help annotate SWISS-PROT entries. The main characteristics and the techniques of family and domain identification used by PROSITE are reviewed in this paper.

897 citations


Journal ArticleDOI
TL;DR: The classical twin study is the most popular design in behavioural genetics and the flexibility of Mx allows the modelling of multivariate data to examine the genetic and environmental relations between two or more phenotypes and the modelling to categorical traits under liability-threshold models.
Abstract: The classical twin study is the most popular design in behavioural genetics. It has strong roots in biometrical genetic theory, which allows predictions to be made about the correlations between observed traits of identical and fraternal twins in terms of underlying genetic and environmental components. One can infer the relative importance of these 'latent' factors (model parameters) by structural equation modelling (SEM) of observed covariances of both twin types. SEM programs estimate model parameters by minimising a goodness-of-fit function between observed and predicted covariance matrices, usually by the maximum-likelihood criterion. Likelihood ratio statistics also allow the comparison of fit of different competing models. The program Mx, specifically developed to model genetically sensitive data, is now widely used in twin analyses. The flexibility of Mx allows the modelling of multivariate data to examine the genetic and environmental relations between two or more phenotypes and the modelling to categorical traits under liability-threshold models.

686 citations


Journal ArticleDOI
Cathryn M. Lewis1
TL;DR: The transmission disequilibrium test is provided as an alternative family-based test, which is robust to population stratification, and the relative benefits of each analysis are summarised.
Abstract: This paper provides a review of the design and analysis of genetic association studies. In case control studies, the different contingency tables and their relationships to the underlying genetic model are defined. Population stratification is discussed, with suggested methods to identify and correct for the effect. The transmission disequilibrium test is provided as an alternative family-based test, which is robust to population stratification. The relative benefits of each analysis are summarised.

452 citations


Journal ArticleDOI
TL;DR: Native BioMOBY objects are lightweight XML, and make up both the query and the response of a simple object access protocol (SOAP) transaction.
Abstract: BioMOBY is an Open Source research project which aims to generate an architecture for the discovery and distribution of biological data through web services; data and services are decentralised, but the availability of these resources, and the instructions for interacting with them, are registered in a central location called MOBY Central. BioMOBY adds to the web services paradigm, as exemplified by Universal Data Discovery and Integration (UDDI), by having an object-driven registry query system with object and service ontologies. This allows users to traverse expansive and disparate data sets where each possible next step is presented based on the data object currently in-hand. Moreover, a path from the current data object to a desired final data object could be automatically discovered using the registry. Native BioMOBY objects are lightweight XML, and make up both the query and the response of a simple object access protocol (SOAP) transaction.

369 citations


Journal ArticleDOI
TL;DR: The ProDom database is a comprehensive set of protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases that makes it particularly useful to help sustain the growth of InterPro.
Abstract: The ProDom database is a comprehensive set of protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases. An associated database, ProDom-CG, has been derived as a restriction of ProDom to completely sequenced genomes. The ProDom construction method is based on iterative PSI-BLAST searches and multiple alignments are generated for each domain family. The ProDom web server provides the user with a set of tools to visualise multiple alignments, phylogenetic trees and domain architectures of proteins, as well as a BLAST-based server to analyse new sequences for homologous domains. The comprehensive nature of ProDom makes it particularly useful to help sustain the growth of InterPro.

360 citations


Journal ArticleDOI
TL;DR: InterPro was developed as an integrated documentation resource for protein families, domains and functional sites, to rationalise the complementary efforts of the individual protein signature database projects.
Abstract: The exponential increase in the submission of nucleotide sequences to the nucleotide sequence database by genome sequencing centres has resulted in a need for rapid, automatic methods for classification of the resulting protein sequences. There are several signature and sequence cluster-based methods for protein classification, each resource having distinct areas of optimum application owing to the differences in the underlying analysis methods. In recognition of this, InterPro was developed as an integrated documentation resource for protein families, domains and functional sites, to rationalise the complementary efforts of the individual protein signature database projects. The member databases - PRINTS, PROSITE, Pfam, ProDom, SMART and TIGRFAMs - form the InterPro core. Related signatures from each member database are unified into single InterPro entries. Each InterPro entry includes a unique accession number, functional descriptions and literature references, and links are made back to the relevant member database(s). Release 4.0 of InterPro (November 2001) contains 4,691 entries, representing 3,532 families, 1,068 domains, 74 repeats and 15 sites of post-translational modification (PTMs) encoded by different regular expressions, profiles, fingerprints and hidden Markov models (HMMs). Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (2,141,621 InterPro hits from 586,124 SWISS-PROT and TrEMBL protein sequences). The database is freely accessible for text- and sequence-based searches.

344 citations


Journal ArticleDOI
TL;DR: Ongoing developments include the further improvement of functional and automatic annotation in the databases including evidence attribution with particular emphasis on the human, archaeal and bacterial proteomes and the provision of additional resources such as the International Protein Index (IPI) and XML format of SWISS-PROT and TrEMBL to the user community.
Abstract: SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and a high level of integration with other databases. Together with its automatically annotated supplement TrEMBL, it provides a comprehensive and high-quality view of the current state of knowledge about proteins. Ongoing developments include the further improvement of functional and automatic annotation in the databases including evidence attribution with particular emphasis on the human, archaeal and bacterial proteomes and the provision of additional resources such as the International Protein Index (IPI) and XML format of SWISS-PROT and TrEMBL to the user community.

262 citations


Journal ArticleDOI
TL;DR: Algorithmic considerations in a new approach for haplotype determination: inferring haplotypes from localised polymorphism data gathered from short genome 'fragments' are presented.
Abstract: With the consensus human genome sequenced and many other sequencing projects at varying stages of completion, greater attention is being paid to the genetic differences among individuals and the abilities of those differences to predict phenotypes. A significant obstacle to such work is the difficulty and expense of determining haplotypes - sets of variants genetically linked because of their proximity on the genome - for large numbers of individuals for use in association studies. This paper presents some algorithmic considerations in a new approach for haplotype determination: inferring haplotypes from localised polymorphism data gathered from short genome 'fragments.' Formalised models of the biological system under consideration are examined, given a variety of assumptions about the goal of the problem and the character of optimal solutions. Some theoretical results and algorithms for handling haplotype assembly given the different models are then sketched. The primary conclusion is that some important simplified variants of the problem yield tractable problems while more general variants tend to be intractable in the worst case.

226 citations


Journal ArticleDOI
Sue A. Olson1

167 citations


Journal ArticleDOI
TL;DR: The concepts behind QSAR are introduced, problems that may be encountered are pointed out, ways of avoiding the pitfalls are suggested and several exciting, newQSAR methods discovered during the last decade are introduced.
Abstract: Empirical methods for building predictive models of the relationships between molecular structure and useful properties are becoming increasingly important. This has arisen because drug discovery and development have become more complex. A large amount of biological target information is becoming available through molecular biology. Automation of chemical synthesis and pharmacological screening has also provided a vast amount of experimental data. Tools for designing libraries and extracting information from molecular databases and high-throughput screening experiments robustly and quickly enable leads to be discovered more effectively. As drug leads progress down the development pipeline, the ability to predict physicochemical, pharmacokinetic and toxicological properties of these leads is becoming increasingly important in reducing the number of expensive, late development failures. Quantitative structure-activity relationship (QSAR) methods have much to offer in these areas. However, QSAR analysis has many traps for unwary practitioners. This review introduces the concepts behind QSAR, points out problems that may be encountered, suggests ways of avoiding the pitfalls and introduces several exciting, new QSAR methods discovered during the last decade.

127 citations


Journal ArticleDOI
TL;DR: Reading statistical methods in bioinformatics an introduction is also a way as one of the collective books that gives many advantages.
Abstract: No wonder you activities are, reading will be always needed. It is not only to fulfil the duties that you need to finish in deadline time. Reading will encourage your mind and thoughts. Of course, reading will greatly develop your experiences about everything. Reading statistical methods in bioinformatics an introduction is also a way as one of the collective books that gives many advantages. The advantages are not only for you, but for the other peoples with those meaningful benefits.

Journal ArticleDOI
TL;DR: The technique further departs from other pattern-matching methods by readily allowing the creation of fingerprints at superfamily-, family- and subfamily-specific levels, thereby allowing more fine-grained diagnoses.
Abstract: The PRINTS database houses a collection of protein fingerprints, which may be used to assign family and functional attributes to uncharacterised sequences, such as those currently emanating from the various genome-sequencing projects. The April 2002 release includes 1,700 family fingerprints, encoding ~10,500 motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. Fingerprints are groups of conserved motifs that, taken together, provide diagnostic protein family signatures. They derive much of their potency from the biological context afforded by matching motif neighbours; this makes them at once more flexible and powerful than single-motif approaches. The technique further departs from other pattern-matching methods by readily allowing the creation of fingerprints at superfamily-, family- and subfamily-specific levels, thereby allowing more fine-grained diagnoses. Here, we provide an overview of the method of protein fingerprinting and how the results of fingerprint analyses are used to build PRINTS and its relational cousin, PRINTS-S.

Journal ArticleDOI
TL;DR: The general field of information extraction is introduced, the status of the applications in molecular biology is outlined, and some ideas about possible standards for evaluation that are needed for the future development of the field are discussed.
Abstract: Information extraction has become a very active field in bioinformatics recently and a number of interesting papers have been published. Most of the efforts have been concentrated on a few specific problems, such as the detection of protein-protein interactions and the analysis of DNA expression arrays, although it is obvious that there are many other interesting areas of potential application (document retrieval, protein functional description, and detection of disease-related genes to name a few). Paradoxically, these exciting developments have not yet crystallised into general agreement on a set of standard evaluation criteria, such as the ones developed in fields such as protein structure prediction, which makes it very difficult to compare performance across these different systems. In this review we introduce the general field of information extraction, we outline the status of the applications in molecular biology, and we then discuss some ideas about possible standards for evaluation that are needed for the future development of the field.

Journal ArticleDOI
TL;DR: Issues related to bioinformatics, namely data analysis, visualisation and archival, are the main focus of this review of metabolite profiling in functional genomics.
Abstract: Metabolic profiling applied to functional genomics (metabolomics) is in an early stage of development. Here, the technologies used for metabolite profiling are briefly covered, illustrated by a few pioneering studies. Issues related to bioinformatics, namely data analysis, visualisation and archival, are the main focus of this review. Arguably there is already a need for databases containing metabolite profiles specific for a single organism, and a generic repository containing all metabolite profiling results, regardless of species. Data analyses and visualisations that combine the biological context with chemistry details are suggested as being the most promising.

Journal ArticleDOI
TL;DR: The amino acid sequence motifs that direct proteins to their proper subcellular compartment are surveyed, different methods for localisation prediction are discussed, and some benchmarks for the more commonly used predictors are presented.
Abstract: Predicting the subcellular localisation of proteins is an important part of the elucidation of their functions and interactions. Here, the amino acid sequence motifs that direct proteins to their proper subcellular compartment are surveyed, different methods for localisation prediction are discussed, and some benchmarks for the more commonly used predictors are presented.

Journal ArticleDOI
TL;DR: This review briefly compares the quirks of the underlying languages and the functionality, documentation, utility and relative advantages of the Bio counterparts, particularly from the point of view of the beginning biologist programmer.
Abstract: Bioinformatics research is often difficult to do with commercial software. The Open Source BioPerl, BioPython and Biojava projects provide toolkits with multiple functionality that make it easier to create customised pipelines or analysis. This review briefly compares the quirks of the underlying languages and the functionality, documentation, utility and relative advantages of the Bio counterparts, particularly from the point of view of the beginning biologist programmer.

Journal ArticleDOI
TL;DR: The applications of InterPro span a range of biologically important areas that includes automatic annotation of protein sequences and genome analysis, and provides a means to carry out statistical and comparative analyses of whole genomes.
Abstract: The applications of InterPro span a range of biologically important areas that includes automatic annotation of protein sequences and genome analysis. In automatic annotation of protein sequences InterPro has been utilised to provide reliable characterisation of sequences, identifying them as candidates for functional annotation. Rules based on the InterPro characterisation are stored and operated through a database called RuleBase. RuleBase is used as the main tool in the sequence database group at the EBI to apply automatic annotation to unknown sequences. The annotated sequences are stored and distributed in the TrEMBL protein sequence database. InterPro also provides a means to carry out statistical and comparative analyses of whole genomes. In the Proteome Analysis Database, InterPro analyses have been combined with other analyses based on CluSTr, the Gene Ontology (GO) and structural information on the proteins.


Journal ArticleDOI
TL;DR: The process of building a new database relevant to some field of study in biomedicine involves transforming, integrating and cleansing multiple data sources, as well as adding new material and annotations.
Abstract: The process of building a new database relevant to some field of study in biomedicine involves transforming, integrating and cleansing multiple data sources, as well as adding new material and annotations. This paper reviews some of the requirements of a general solution to this data integration problem. Several representative technologies and approaches to data integration in biomedicine are surveyed. Then some interesting features that separate the more general data integration technologies from the more specialised ones are highlighted.

Journal ArticleDOI
TL;DR: The basics of the most widely used conceptual modelling notations, the ER (entity-relationship) model and the class diagrams of the UML (unified modelling language), are described and their use through several examples from bioinformatics is illustrated.
Abstract: Current research in the biosciences depends heavily on the effective exploitation of huge amounts of data. These are in disparate formats, remotely dispersed, and based on the different vocabularies of various disciplines. Furthermore, data are often stored or distributed using formats that leave implicit many important features relating to the structure and semantics of the data. Conceptual data modelling involves the development of implementation-independent models that capture and make explicit the principal structural properties of data. Entities such as a biopolymer or a reaction, and their relations, eg catalyses, can be formalised using a conceptual data model. Conceptual models are implementation-independent and can be transformed in systematic ways for implementation using different platforms, eg traditional database management systems. This paper describes the basics of the most widely used conceptual modelling notations, the ER (entity-relationship) model and the class diagrams of the UML (unified modelling language), and illustrates their use through several examples from bioinformatics. In particular, models are presented for protein structures and motifs, and for genomic sequences.

Journal ArticleDOI
TL;DR: This paper reviews the Pfam, TIGRFAMs and SMART databases that use the profile-HMMs provided by the HMMER package to find hidden Markov models used for protein evolution and function detection.
Abstract: Protein family databases are an important resource for protein annotation and understanding protein evolution and function. In recent years hidden Markov models (HMMs) have become one of the key technologies used for detection of members of these families. This paper reviews the Pfam, TIGRFAMs and SMART databases that use the profile-HMMs provided by the HMMER package.



Journal ArticleDOI
TL;DR: An overview of the history and funding of bioinformatics training in the USA is provided, and some of the challenges and key features associated with bioinformics training programmes at PhD level are summarized.
Abstract: This paper provides an overview of the history and funding of bioinformatics training in the USA, and summarises some of the challenges and key features associated with bioinformatics training programmes at PhD level. The paper includes compilations of current PhD bioinformatics training programmes and sources of funding.

Journal ArticleDOI
TL;DR: Seven popular programs for gene prediction in eukaryotic organisms are described and evaluated on the basis of availability for in-house and on-line use and prediction accuracy.
Abstract: Seven popular programs for gene prediction in eukaryotic organisms are described and evaluated on the basis of availability for in-house and on-line use and prediction accuracy. This report outlines generally applicable approaches to computational gene prediction and known limitations in this field.

Journal ArticleDOI
TL;DR: GeneAtlas is an automated, high-throughput pipeline for the prediction of protein structure and function using sequence similarity detection, homology modelling and fold recognition methods, and can recognise Functionally related proteins with sequence identity below the twilight zone correctly.
Abstract: To maximise the assignment of function of the proteins encoded by a genome and to aid the search for novel drug targets, there is an emerging need for sensitive methods of predicting protein function on a genome-wide basis. GeneAtlas is an automated, high-throughput pipeline for the prediction of protein structure and function using sequence similarity detection, homology modelling and fold recognition methods. GeneAtlas is described in detail here. To test GeneAtlas, a 'virtual' genome was used, a subset of PDB structures from the SCOP database, in which the functional relationships are known. GeneAtlas detects additional relationships by building 3D models in comparison with the sequence searching method PSI-BLAST. Functionally related proteins with sequence identity below the twilight zone can be recognised correctly.

Journal ArticleDOI
TL;DR: Because of its very nature, and the existence of only four common bases, more information for the alignment can be obtained by using protein sequences, and it often makes sense to translate regions of coding DNA into protein sequence before aligning them.
Abstract: Whether the ultimate aim is a phylogenetic analysis of several orthologues, the identification of a pattern for particular feature or motif, or the basis for structural modelling, multiple sequence alignments allow the researcher to gather more biological information than a single sequence can offer. Possibly the most popular method for comparing three or more sequences is the clustering algorithm used in applications such as the Clustal (ClustalW and ClustalX) series of programs. It is certainly by no means the only method of alignment, but will be used to illustrate this text. Initial clustering of sequence pairs reduces the computing time required to align multiple sequences and this can be achieved using one of two possible methods. Slow clustering is the more rigorous of the two options, but is noticeably much slower for approximately 20 or more sequences, or fewer, longer regions. It uses the dynamic programming method of Needleman–Wunsch to align each sequence with another according to a weight matrix and gap penalties. The ultimate aim of the computer program is to achieve the highest score possible, within the constraints the program has been placed under. Weight matrices have been developed using homologous sequences, and allocate a score to each residue or nucleotide base indicating the probability of it replacing a different residue or nucleotide base as a possible mutation. In the case of protein sequences, this has been done for all 20 amino acid residues, together with the three ambiguity codes (B 1⁄4 Asp and Asn, Z 1⁄4 Glu and Gln, X 1⁄4 any residue) using several different methods. Nucleotide matrices have also been developed, and in general indicate a positive score for an identical match, and no score, or a negative one for a mismatch. Because of its very nature, and the existence of only four common bases, more information for the alignment can be obtained by using protein sequences, and it often makes sense to translate regions of coding DNA into protein sequence before aligning them. Once a high score has been achieved for each of the sequence pairs in the alignment, they are clustered together in accordance with their relative scores, using the neighbour-joining method to link the closest pairings together, and less similar sequences more remotely. This information is stored as a series of numerical distances arranged by means of nested brackets in a dendrogram file. This file is in no way representative of evolutionary distances, and should not be presented as such. It merely represents the proximity of each sequence within a cluster, and each cluster to another and is used to form the final alignment. The information retained in the dendrogram file may be kept and used to align other multiple sequence sets. Larger sequence volumes may be compared using a faster method, in order to reduce computing time. This is based on the algorithm of Wilbur and Lipman and is quicker but less accurate than the dynamic programming methods of the slow comparison. It involves definition of

Journal ArticleDOI
Thomas D. Wu1
TL;DR: The accumulation of DNA microarray data has now made it possible to use gene expression profiles to analyse expression data, and Hypothesis tests may be applied to expression profiles on a large scale to identify candidate genes of interest.
Abstract: The accumulation of DNA microarray data has now made it possible to use gene expression profiles to analyse expression data. A gene expression profile contains the expression data for a given gene over various samples, and can be contrasted with an expression signature, which contains the expression data for a single sample. Gene expression profiles are most revealing when samples are grouped appropriately, either by standard clinical or pathological categories or by categories discovered through cluster analysis techniques. Expression profiles can exist at various levels of abstraction, yielding information across various tissues or across diseases within a particular tissue. Hypothesis tests may be applied to expression profiles on a large scale to identify candidate genes of interest.

Journal ArticleDOI
TL;DR: This review includes a short table of milestones in global computing history, lists opportunities global computing offers for bioinformatics, describes the structure of problems well suited for such an approach, and analyses the anatomy of successful projects.
Abstract: Global computing, the collaboration of idle PCs via the Internet in a SETI@home style, emerges as a new way of massive parallel multiprocessing with potentially enormous CPU power. Its relations to the broader, fast-moving field of Grid computing are discussed without attempting a review of the latter. This review (i) includes a short table of milestones in global computing history, (ii) lists opportunities global computing offers for bioinformatics, (iii) describes the structure of problems well suited for such an approach, (iv) analyses the anatomy of successful projects and (v) points to existing software frameworks. Finally, an evaluation of the various costs shows that global computing indeed has merit, if the problem to be solved is already coded appropriately and a suitable global computing framework can be found. Then, either significant amounts of computing power can be recruited from the general public, or - if employed in an enterprise-wide Intranet for security reasons - idle desktop PCs can substitute for an expensive dedicated cluster.

Journal ArticleDOI
TL;DR: This work discusses the problems of alignment, gene finding and regulatory element discovery, and discusses the issues that have arisen in attempts to solve these problems in the context of whole genome analysis pipelines.
Abstract: The explosion in genomic sequence available in public databases has resulted in an unprecedented opportunity for computational whole genome analyses. A number of promising comparative-based approaches have been developed for gene finding, regulatory element discovery and other purposes, and it is clear that these tools will play a fundamental role in analysing the enormous amount of new data that is currently being generated. The synthesis of computationally intensive comparative computational approaches with the requirement for whole genome analysis represents both an unprecedented challenge and opportunity for computational scientists. We focus on a few of these challenges, using by way of example the problems of alignment, gene finding and regulatory element discovery, and discuss the issues that have arisen in attempts to solve these problems in the context of whole genome analysis pipelines.