scispace - formally typeset
Search or ask a question

Showing papers by "Rob Knight published in 2005"


Journal ArticleDOI
TL;DR: The results illustrate that UniFrac provides a new way of characterizing microbial communities, using the wealth of environmental rRNA sequences, and allows quantitative insight into the factors that underlie the distribution of lineages among environments.
Abstract: We introduce here a new method for computing differences between microbial communities based on phylogenetic information. This method, UniFrac, measures the phylogenetic distance between sets of taxa in a phylogenetic tree as the fraction of the branch length of the tree that leads to descendants from either one environment or the other, but not both. UniFrac can be used to determine whether communities are significantly different, to compare many communities simultaneously using clustering and ordination techniques, and to measure the relative contributions of different factors, such as chemistry and geography, to similarities between samples. We demonstrate the utility of UniFrac by applying it to published 16S rRNA gene libraries from cultured isolates and environmental clones of bacteria in marine sediment, water, and ice. Our results reveal that (i) cultured isolates from ice, water, and sediment resemble each other and environmental clone sequences from sea ice, but not environmental clone sequences from sediment and water; (ii) the geographical location does not correlate strongly with bacterial community differences in ice and sediment from the Arctic and Antarctic; and (iii) bacterial communities differ between terrestrially impacted seawater (whether polar or temperate) and warm oligotrophic seawater, whereas those in individual seawater samples are not more similar to each other than to those in sediment or ice samples. These results illustrate that UniFrac provides a new way of characterizing microbial communities, using the wealth of environmental rRNA sequences, and allows quantitative insight into the factors that underlie the distribution of lineages among environments.

6,679 citations


Journal ArticleDOI
TL;DR: It is shown that a substantial fraction of the genetic code has a stereochemical basis, the triplets having escaped from their original function in amino acid-binding sites to become modern codons and anticodons.
Abstract: ▪ Abstract There is very significant evidence that cognate codons and/or anticodons are unexpectedly frequent in RNA-binding sites for seven of eight biological amino acids that have been tested. This suggests that a substantial fraction of the genetic code has a stereochemical basis, the triplets having escaped from their original function in amino acid–binding sites to become modern codons and anticodons. We explicitly show that this stereochemical basis is consistent with subsequent optimization of the code to minimize the effect of coding mistakes on protein structure. These data also strengthen the argument for invention of the genetic code in an RNA world and for the RNA world itself.

165 citations


Journal ArticleDOI
TL;DR: It is shown that cartilage defects arise after NCC migration during skeletal differentiation, and that they can be rescued by transplantation of wild-type ectoderm.
Abstract: AP2 transcription factors regulate many aspects of embryonic development. Studies of AP2a (Tfap2a) function in mice and zebrafish have demonstrated a role in patterning mesenchymal cells of neural crest origin that form the craniofacial skeleton, while the mammalian Tfap2b is required in both the facial skeleton and kidney. Here, we show essential functions for zebrafish tfap2a and tfap2b in development of the facial ectoderm, and for signals from this epithelium that induce skeletogenesis in neural crest cells (NCCs). Zebrafish embryos deficient for both tfap2a and tfap2b show defects in epidermal cell survival and lack NCC-derived cartilages. We show that cartilage defects arise after NCC migration during skeletal differentiation, and that they can be rescued by transplantation of wild-type ectoderm. We propose a model in which AP2 proteins play two distinct roles in cranial NCCs: an early cell-autonomous function in cell specification and survival, and a later non-autonomous function regulating ectodermal signals that induce skeletogenesis.

72 citations


Journal ArticleDOI
TL;DR: It is found that actual tRNAs have significantly more matches between the two halves than do random sequences that can form the tRNA structure, and the hypothesis that the modern tRNA cloverleaf arose from a single hairpin duplication prior to the divergence of modern t RNA specificities and the three domains of life is supported.
Abstract: Many studies have suggested that the modern cloverleaf structure of tRNA may have arisen through duplication of a primordial hairpin, but the timing of this duplication event has been unclear Here we measure the level of sequence identity between the two halves of each of a large sample of tRNAs and compare this level to that of chimeric tRNAs constructed either within or between groups defined by phylogeny and/or specificity We find that actual tRNAs have significantly more matches between the two halves than do random sequences that can form the tRNA structure, but there is no difference in the average level of matching between the two halves of an individual tRNA and the average level of matching between the two halves of the chimeric tRNAs in any of the sets we constructed These results support the hypothesis that the modern tRNA cloverleaf arose from a single hairpin duplication prior to the divergence of modern tRNA specificities and the three domains of life

72 citations


Journal ArticleDOI
01 Nov 2005-RNA
TL;DR: Estimated apparent initial abundances suggest that the simplest isoleucine motif was 20- to 40-fold more frequent in selection with 50- or 70-nucleotide randomized regions than with any other length, and support a significant but lesser role for primer sequences in the outcome of selections.
Abstract: Because the abundance of functional molecules in RNA sequence space has many unexplored aspects, we compared the outcome of 11 independent selections, performed using the same affinity selection protocol and contiguous randomized regions of 16, 22, 26, 50, 70, and 90 nucleotides. All affinity selections targeted the simplest isoleucine aptamer, an asymmetric internal loop. This loop should be abundant in all selections, so that it can be compared across all experiments. In some cases, two primer sets intended to favor selection of different structures have also been compared. The simplest isoleucine aptamer dominates all selections except with the shortest tract, 16 contiguous randomized nucleotides. Here the isoleucine aptamer cannot be accommodated and no other motif can be selected. Our results suggest an optimum length for selection; surprisingly, both the shortest and the longest randomized tracts make it more difficult to recover the motif. Estimated apparent initial abundances suggest that the simplest isoleucine motif was 20- to 40-fold more frequent in selection with 50- or 70-nucleotide randomized regions than with any other length. Considering primer sets, a pre-formed stable stem within fixed flanking sequences had a five-to 10-fold negative effect on apparent motif abundance at all lengths. Differing random tract lengths also determined the probable motif permutation and the most abundant helix lengths. These data support a significant but lesser role for primer sequences in the outcome of selections.

67 citations


Journal ArticleDOI
TL;DR: A new method is developed for calculating the probability of finding a modular motif containing base-paired regions, and a computational grid is used to fold several hundred million random RNA sequences containing the core elements of the isoleucine aptamer and the hammerhead ribozyme to estimate the probability that a sequence containing these structural elements will fold correctly when isolated from background sequences of different compositions.
Abstract: Although functional RNA molecules are known to be biased in overall composition, the effects of background composition on the probability of finding a particular active site by chance has received little attention. The probability of finding a particular motif has important implications both for understanding the distribution of functional RNAs in ancient and modern organisms with varying genome compositions and for tuning SELEX pools to optimize the chance of finding specific functions. Here we develop a new method for calculating the probability of finding a modular motif containing base-paired regions, and use a computational grid to fold several hundred million random RNA sequences containing the core elements of the isoleucine aptamer and the hammerhead ribozyme to estimate the probability that a sequence containing these structural elements will fold correctly when isolated from background sequences of different compositions. We find that the two motifs are most likely to be found in distinct regions of compositional space, and that the regions of greatest abundance are influenced by the probability of finding the conserved bases, finding the flanking helices, and folding, in that order of importance. Additionally, we can refine our estimates of the number of random sequences required for a 50% probability of finding an example of each site in unbiased random pools of length 100 to 4.1 x 10(9) for the isoleucine aptamer and 1.6 x 10(10) for the hammerhead ribozyme. These figures are consistent with the facile recovery of these motifs from SELEX experiments.

57 citations


Journal ArticleDOI
TL;DR: Comparing the code error with the coding triplet concentrations in RNA binding sites for eight amino acids shows that these properties are independent and uncorrelated, and error minimization and triplet associations probably arose independently during the history of the genetic code.
Abstract: The canonical genetic code has been reported both to be error minimizing and to show stereochemical associations between coding triplets and binding sites. In order to test whether these two properties are unexpectedly overlapping, we generated 200,000 randomized genetic codes using each of five randomization schemes, with and without randomization of stop codons. Comparison of the code error (difference in polar requirement for single-nucleotide codon interchanges) with the coding triplet concentrations in RNA binding sites for eight amino acids shows that these properties are independent and uncorrelated. Thus, one is not the result of the other, and error minimization and triplet associations probably arose independently during the history of the genetic code. We explicitly show that prior fixation of a stereochemical core is consistent with an effective later minimization of error.

41 citations


Journal ArticleDOI
TL;DR: These unexpected findings suggest that selection against translation error has not produced codon or amino-acid usages that minimize the effects of errors, and that even messages with very different nucleotide compositions somehow maintain a relatively constant error value.
Abstract: Background: Do species use codons that reduce the impact of errors in translation or replication? The genetic code is arranged in a way that minimizes errors, defined as the sum of the differences in amino-acid properties caused by single-base changes from each codon to each other codon. However, the extent to which organisms optimize the genetic messages written in this code has been far less studied. We tested whether codon and amino-acid usages from 457 bacteria, 264 eukaryotes, and 33 archaea minimize errors compared to random usages, and whether changes in genome G+C content influence these error values. Results: We tested the hypotheses that organisms choose their codon usage to minimize errors, and that the large observed variation in G+C content in coding sequences, but the low variation in G+U or G+A content, is due to differences in the effects of variation along these axes on the error value. Surprisingly, the biological distribution of error values has far lower variance than randomized error values, but error values of actual codon and amino-acid usages are actually greater than would be expected by chance. Conclusion: These unexpected findings suggest that selection against translation error has not produced codon or amino-acid usages that minimize the effects of errors, and that even messages with very different nucleotide compositions somehow maintain a relatively constant error value. They raise the question: why do all known organisms use highly error-minimizing genetic codes, but fail to minimize the errors in the mRNA messages they encode?

28 citations


Journal ArticleDOI
TL;DR: The authors argue that the challenges encountered in proteomics provide a valuable lesson on the complexity of life itself, as live organisms always contradict oversimplified models of biological information flow.
Abstract: In this article, a survey on experimental and computational approaches related to proteomics is presented. Considered broadly, proteomics includes: techniques for identifying proteins in a sample, detecting posttranslational modifications (changes to proteins after translation), predicting the structure and function of proteins from sequence data, and integrating information about protein sequences from different databases. The paper focuses on the ways in which recent biological findings complicate the mapping from genes to RNA to protein. The authors argue that the challenges encountered in proteomics provide a valuable lesson on the complexity of life itself, as live organisms always contradict oversimplified models of biological information flow. In this overview, a snapshot of contemporary issues in proteomics is shown.

17 citations


Journal ArticleDOI
TL;DR: It is demonstrated a strong association between the occurrence of AARSs in the complexes and the volume of their substrate amino acids, and the significance of this association is discussed in terms of the structural organization of translation in the living cell.

11 citations



Journal ArticleDOI
TL;DR: In this paper, a new database that relates structural information from proteins in protein data bank to closely related protein sequences in humans was developed, which can be used to answer many kinds of structural questions (including questions related to posttranslational modifications).
Abstract: In this article, a new database that relates structural information from proteins in protein data bank to closely related protein sequences in humans was developed. Because the match criteria are extremely stringent, the structure of proteins in other species to infer characteristics of the human proteins was used. As a demonstration of the approach, this database has been applied to the problem of identifying likely trypsin miscleavage sites, a significant problem in proteomics. However, the approach is very general, and can be used to answer many kinds of structural questions (including questions related to posttranslational modifications). The study found that both the surface area and the secondary structure of cleavage sites have highly statistically significant effects on trypsin cleavage. The results of this analysis do not, however, suggest that surface area or secondary structure properties of particular peptides can be used to predict miscleavage sites, at least at a global level. This analysis of cleavage sites demonstrates the general power of homology-based techniques, in which the characteristics of a single protein that has a structure that has been solved can be used to infer properties of other proteins. We expect that our database of related proteins, structures, and sequences and our ability to query experimentally determined sets of peptides against this database will allow us to answer many other questions relation to global protein expression and modification.

Proceedings ArticleDOI
15 May 2005
TL;DR: This paper describes techniques and software developed that allow to apply the power of computational grids to large-scale, loosely coupled parallel bioinformatics problems, and demonstrates seamless performance on an ad-hoc grid composed of a wide variety of hardware for a real-life parallel bio informatics problem.
Abstract: In recent years our society has witnessed an unprecedented growth in computing power available to tackle important problems in science, engineering and medicine. For example, the SHARCNET network links large computing resources in 11 leading academic institutions in South Central Ontario, thus providing access to thousands of compute processors. It is a continuous challenge to develop efficient and scalable algorithms and methods for solving large scientific and engineering problems on such parallel and distributed computers. If the computing power available in such computational grids can be unleashed effectively in a scalable way, large scientific problems can be solved that would otherwise be hard to solve using the machines available in a stand-alone way. This paper describes techniques and software developed that allow to apply the power of computational grids to large-scale, loosely coupled parallel bioinformatics problems. Our approach is based on decentralization and implemented in Java, leading to a flexible, portable and scalable software solution for parallel bioinformatics. We discuss advantages and disadvantages of this approach, and demonstrate seamless performance on an ad-hoc grid composed of a wide variety of hardware for a real-life parallel bioinformatics problem. The bioinformatics problem described consists of virtual experiments in RNA folding executed on hundreds of compute processors concurrently, which may establish one of the missing links in the chain events that led to the origin of life.

Book ChapterDOI
09 Jun 2005
TL;DR: This work performed virtual experiments in RNA folding on computational grids composed of fast supercomputers, in order to estimate the smallest pool of random RNA molecules that would contain enough catalytic motifs for starting a primitive metabolism.
Abstract: Due to ever-increasing data sizes and the high computational complexity of many algorithms, there is a natural drive towards applying parallel and distributed computing to bioinformatics problems. Grid computing techniques can provide flexible, portable and scalable software solutions for parallel bioinformatics. Here we describe the TaskSpaces software framework for grid computing. TaskSpaces is characterized by two major design choices: decentralization, provided by an underlying tuple space concept, and platform independence, provided by implementation in Java. We discuss advantages and disadvantages of this approach, and demonstrate seamless performance on an ad-hoc grid composed of a wide variety of hardware for a real-life parallel bioinformatics problem. Specifically, we performed virtual experiments in RNA folding on computational grids composed of fast supercomputers, in order to estimate the smallest pool of random RNA molecules that would contain enough catalytic motifs for starting a primitive metabolism. These experiments may establish one of the missing links in the chain of events that led to the origin of life. — Note: To appear as a Chapter in the textbook Parallel Computing in Bioinformatics and Computational Biology, A. Zomaya, editor, John Wiley and Sons, 2005.