scispace - formally typeset
Search or ask a question

Showing papers by "Walter R. Gilks published in 2005"


Journal ArticleDOI
TL;DR: The aim is to combine genomic interaction data, in which domain-domain contacts are not explicitly reported, with the domain-level structure of individual proteins, in order to learn about the structure of interacting protein pairs.
Abstract: Motivation: Several methods have recently been developed to analyse large-scale sets of physical interactions between proteins in terms of physical contacts between the constituent domains, often with a view to predicting new pairwise interactions. Our aim is to combine genomic interaction data, in which domain--domain contacts are not explicitly reported, with the domain-level structure of individual proteins, in order to learn about the structure of interacting protein pairs. Our approach is driven by the need to assess the evidence for physical contacts between domains in a statistically rigorous way. Results: We develop a statistical approach that assigns p-values to pairs of domain superfamilies, measuring the strength of evidence within a set of protein interactions that domains from these superfamilies form contacts. A set of p-values is calculated for SCOP superfamily pairs, based on a pooled data set of interactions from yeast. These p-values can be used to predict which domains come into contact in an interacting protein pair. This predictive scheme is tested against protein complexes in the Protein Quaternary Structure (PQS) database, and is used to predict domain--domain contacts within 705 interacting protein pairs taken from our pooled data set. Contact: thomas.nye@mrc-bsu.cam.ac.uk

89 citations


Journal ArticleDOI
TL;DR: A probabilistic framework for exploring the consequences of percolation of errors through protein databases is developed, and the theory is applied to hierarchically structured protein sequence databases, and conclusions about database quality at different levels of the hierarchy are drawn.
Abstract: Databases of protein sequences have grown rapidly in recent years as a result of genome sequencing projects. Annotating protein sequences with descriptions of their biological function ideally requires careful experimentation, but this work lags far behind. Instead, biological function is often imputed by copying annotations from similar protein sequences. This gives rise to annotation errors, and more seriously, to chains of misannotation. [Percolation of annotation errors in a database of protein sequences (2002)] developed a probabilistic framework for exploring the consequences of this percolation of errors through protein databases, and applied their theory to a simple database model. Here we apply the theory to hierarchically structured protein sequence databases, and draw conclusions about database quality at different levels of the hierarchy.

75 citations


Journal ArticleDOI
TL;DR: In this framework, the Correspondence Indicators are defined as measures of relationship between sequence and function and two Bayesian approaches are formulated to estimate the probability for a sequence of unknown function to belong to a functional class.
Abstract: Background: One of the most evident achievements of bioinformatics is the development of methods that transfer biological knowledge from characterised proteins to uncharacterised sequences. This mode of protein function assignment is mostly based on the detection of sequence similarity and the premise that functional properties are conserved during evolution. Most automatic approaches developed to date rely on the identification of clusters of homologous proteins and the mapping of new proteins onto these clusters, which are expected to share functional characteristics. Results: Here, we inverse the logic of this process, by considering the mapping of sequences directly to a functional classification instead of mapping functions to a sequence clustering. In this mode, the starting point is a database of labelled proteins according to a functional classification scheme, and the subsequent use of sequence similarity allows defining the membership of new proteins to these functional classes. In this framework, we define the Correspondence Indicators as measures of relationship between sequence and function and further formulate two Bayesian approaches to estimate the probability for a sequence of unknown function to belong to a functional class. This approach allows the parametrisation of different sequence search strategies and provides a direct measure of annotation error rates. We validate this approach with a database of enzymes labelled by their corresponding four-digit EC numbers and analyse specific cases. Conclusion: The performance of this method is significantly higher than the simple strategy consisting in transferring the annotation from the highest scoring BLAST match and is expected to find applications in automated functional annotation pipelines.

37 citations


Journal ArticleDOI
01 Aug 2005-Blood
TL;DR: An invariant HIV-induced CD antigen signature has been defined that is both robust and independent of clinical outcome, composed of a unique profile of CD antigen expression levels that are both increased and decreased relative to internal controls.

37 citations


Journal ArticleDOI
TL;DR: Analysis of the human and Takifugu rubripes genomes reveals a novel, sharp and distinct signal of nucleotide frequency bias precisely at the border between CNEs and flanking regions.

36 citations


Journal ArticleDOI
TL;DR: A novel statistical method, the "fluffy-tail test", is presented, able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA.
Abstract: This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information. We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA. We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.

33 citations


Journal ArticleDOI
TL;DR: This work presents an approach to data fusion, based on multivariate regression, which applies to data from a previous study on cell-cycle control in Schizosaccharomyces pombe, and implements the algorithm implemented in R.
Abstract: Motivation: It is widely acknowledged that microarray data are subject to high noise levels and results are often platform dependent. Therefore, microarray experiments should be replicated several times and in several laboratories before the results can be relied upon. To make the best use of such extensive datasets, methods for microarray data fusion are required. Ideally, the fused data should distil important aspects of the data while suppressing unwanted sources of variation and be amenable to further informal and formal methods of analysis. Also, the variability in the quality of experimentation should be taken into account. Results: We present such an approach to data fusion, based on multivariate regression. We apply our methodology to data from a previous study on cell-cycle control in Schizosaccharomyces pombe. Availability: The algorithm implemented in R is freely available from the authors on request. Contact: wally.gilks@mrc-bsu.cam.ac.uk

22 citations


Journal ArticleDOI
TL;DR: Kanti Mardia and Walter Gilks consider the future role of statistics in scientific explanation and prediction, through views expressed by eminent scientists, philosophers and statisticians and through their own experience, particularly in the field of bioinformatics.
Abstract: Kanti Mardia and Walter Gilks consider the future role of statistics in scientific explanation and prediction, through views expressed by eminent scientists, philosophers and statisticians and through their own experience, particularly in the field of bioinformatics.

7 citations


Journal ArticleDOI
TL;DR: A likelihood-based method for imputing missing data or weighting poor quality spots that requires a number of biological or technical replicates and is illustrated using data obtained from an experiment to observe gene expression changes with 24 hr paclitaxel treatment on a human cervical cancer derived cell line (HeLa).
Abstract: Background: A common feature of microarray experiments is the occurence of missing gene expression data. These missing values occur for a variety of reasons, in particular, because of the filtering of poor quality spots and the removal of undefined values when a logarithmic transformation is applied to negative background-corrected intensities. The efficiency and power of an analysis performed can be substantially reduced by having an incomplete matrix of gene intensities. Additionally, most statistical methods require a complete intensity matrix. Furthermore, biases may be introduced into analyses through missing information on some genes. Thus methods for appropriately replacing (imputing) missing data and/or weighting poor quality spots are required. Results: We present a likelihood-based method for imputing missing data or weighting poor quality spots that requires a number of biological or technical replicates. This likelihood-based approach assumes that the data for a given spot arising from each channel of a two-dye (twochannel) cDNA microarray comparison experiment independently come from a three-component mixture distribution – the parameters of which are estimated through use of a constrained E-M algorithm. Posterior probabilities of belonging to each component of the mixture distributions are calculated and used to decide whether imputation is required. These posterior probabilities may also be used to construct quality weights that can down-weight poor quality spots in any analysis performed afterwards. The approach is illustrated using data obtained from an experiment to observe gene expression changes with 24 hr paclitaxel (Taxol ®) treatment on a human cervical cancer derived cell line (HeLa). Conclusion: As the quality of microarray experiments affect downstream processes, it is important to have a reliable and automatic method of identifying poor quality spots and arrays. We propose a method of identifying poor quality spots, and suggest a method of repairing the arrays by either imputation or assigning quality weights to the spots. This repaired data set would be less biased and can be analysed using any of the appropriate statistical methods found in the microarray

7 citations