scispace - formally typeset
Search or ask a question

Showing papers by "Richard Durbin published in 1994"


Journal ArticleDOI
03 Mar 1994-Nature
TL;DR: The nucleotide sequence of a contiguous 2,181,032 base pairs in the central gene cluster of chromosome III is completed, and comparison with the public sequence databases reveals similarities to previously known genes for about one gene in three.
Abstract: As part of our effort to sequence the 100-megabase (Mb) genome of the nematode Caenorhabditis elegans, we have completed the nucleotide sequence of a contiguous 2,181,032 base pairs in the central gene cluster of chromosome III. Analysis of the finished sequence has indicated an average density of about one gene per five kilobases; comparison with the public sequence databases reveals similarities to previously known genes for about one gene in three. In addition, the genomic sequence contains several intriguing features, including putative gene duplications and a variety of other repeats with potential evolutionary implications.

1,612 citations


Journal ArticleDOI
TL;DR: This work describes a general approach to several RNA sequence analysis problems using probabilistic models that flexibly describe the secondary structure and primary sequence consensus of an RNA sequence family, called 'covariance models'.
Abstract: We describe a general approach to several RNA sequence analysis problems using probabilistic models that flexibly describe the secondary structure and primary sequence consensus of an RNA sequence family. We call these models 'covariance models'. A covariance model of tRNA sequences is an extremely sensitive and discriminative tool for searching for additional tRNAs and tRNA-related sequences in sequence databases. A model can be built automatically from an existing sequence alignment. We also describe an algorithm for learning a model and hence a consensus secondary structure from initially unaligned example sequences and no prior structural information. Models trained on unaligned tRNA examples correctly predict tRNA secondary structure and produce high-quality multiple alignments. The approach may be applied to any family of small RNA sequences.

853 citations


Journal ArticleDOI
TL;DR: Observations suggest that the dominant HD mutation either confers a new property on the mRNA or alters an interaction at the protein level, suggesting the operation of interacting factors in determining specificity of cell loss.
Abstract: Huntington's disease, a neurodegenerative disorder characterized by loss of striatal neurons, is caused by an expanded, unstable trinucleotide repeat in a novel 4p16.3 gene. To lay the foundation for exploring the pathogenic mechanism in HD, we have determined the structure of the disease gene and examined its expression. TheHD locus spans 180 kb and consists of 67 exons ranging in size from 48 bp to 341 bp with an average of 138 bp. Scanning of theHD transcript failed to reveal any additional sequence alterations characteristic of HD chromosomes. A codon loss polymorphism in linkage disequilibrium with the disorder revealed that both normal and HD alleles are represented in the mRNA population in HD heterozygotes, indicating that the defect does not eliminate transcription. The gene is ubiquitously expressed as two alternatively polyadenylated forms displaying different relative abundance in various fetal and adult tissues, suggesting the operation of interacting factors in determining specificity of cell loss. TheHD gene was disrupted in a female carrying a balanced translocation with a breakpoint between exons 40 and 41. The absence of any abnormal phenotype in this individual argues against simple inactivation of the gene as the mechanism by which the expanded trinucleotide repeat causes HD. Taken together, these observations suggest that the dominant HD mutation either confers a new property on the mRNA or, more likely, alters an interaction at the protein level.

277 citations


Journal ArticleDOI
TL;DR: Two programs, MSPcrunch and Blixem, were developed, which assist in processing the results from the database search programs in the BLAST suite, which removes biased composition and redundant matches while keeping weak matches that are consistent with a larger gapped alignment.
Abstract: When routinely analysing very long stretches of DNA sequences produced by genome sequencing projects, detailed analysis of database search results becomes exceedingly time consuming. To reduce the tedious browsing of large quantities of protein similarities, two programs, MSPcrunch and Blixem, were developed, which assist in processing the results from the database search programs in the BLAST suite. MSPcrunch removes biased composition and redundant matches while keeping weak matches that are consistent with a larger gapped alignment. This makes BLAST searching in practice more sensitive and reduces the risk of overlooking distant similarities. Blixem is a multiple sequence alignment viewer for X-windows which makes it significantly easier to scan and evaluate the matches ratified by MSPcrunch. In Blixem, matches to the translated DNA query sequence are simultaneously aligned in three frames. Also, the distribution of matches over the whole DNA query is displayed. Examples of usage are drawn from 36 C. elegans cosmid clones totalling 1.2 megabases, to which these tools were applied.

123 citations


Book ChapterDOI
TL;DR: Although the amount of genome data being collected is undergoing exponential growth, so is the capacity of computer storage systems, with an even shorter doubling time, so the issue of raw storage capacity is becoming progressively easier.
Abstract: Systematic genome mapping and sequencing projects are generating resources that will permanently change the practice of molecular biology. To maximise their effect, we have to make the information available to the scientific community in as useful a form as possible. It has been said that the sheer quantity of genomic information that we are just now beginning to gather will cause problems for any database system that must store it. That is not in itself strictly true; in fact the current total of genome mapping and sequence data, for all organisms combined, would sit comfortably in a one gigabyte disk, which is small for a workstation, and even conceivable for a PC. Furthermore, although the amount of genome data being collected is undergoing exponential growth, so is the capacity of computer storage systems, with an even shorter doubling time, so the issue of raw storage capacity is becoming progressively easier.

58 citations


Proceedings Article
31 Dec 1994
TL;DR: An automatic way of making most of the decisions a trained sequence analyst would make was developed by means of a rule-based expert system combined with an algorithm to avoid non-informative biased residue composition matches.
Abstract: When confronted with the task of finding homology to large numbers of sequences, database searching tools such as Blast and Fasta generate prohibitively large amounts of information. An automatic way of making most of the decisions a trained sequence analyst would make was developed by means of a rule-based expert system combined with an algorithm to avoid non-informative biased residue composition matches. The results found relevant by the system are presented in a very concise and clear way, so that the homology can be assessed with minimum effort. The expert system, HSPcrunch, was implemented to process the output to the programs in the BLAST suite. HSPcrunch embodies rules on detecting distant similarities when pairs of weak matches are consistent with a larger gapped alignment, i.e. when Blast has broken a longer gapped alignment up into smaller ungapped ones. This way, more distant similarities can be detected with no or little side-effects of more spurious matches. The rules for how small the gaps must be to be considered significant have been derived empirically. Currently a set of rules are used that operate on two different scoring levels, one for very weak matches that have very small gaps and one for medium weak matches that have slightly larger gaps. This set of rules proved to be robust for most cases and gives high fidelity separation between real homologies and spurious matches. One of the most important rules for reducing the amount of output is to limit the number of overlapping matches to the same region of the query sequence.(ABSTRACT TRUNCATED AT 250 WORDS)

11 citations