scispace - formally typeset
Search or ask a question
Topic

Sim4

About: Sim4 is a research topic. Over the lifetime, 18 publications have been published within this topic receiving 82519 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: A freely available computer program solves the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors.
Abstract: We address the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors. A freely available computer program, described herein, solves the problem for a 100-kb genomic sequence in a few seconds on a workstation.

764 citations

Journal ArticleDOI
TL;DR: The EST_GENOME program as discussed by the authors aligns spliced DNA to unspliced genomic DNA using a modified version of Smith and Waterman's expression tag (EST) algorithm.
Abstract: This note describes the program EST_GENOME for aligning spliced DNA to unspliced genomic DNA. It is written in ANSI C and has been tested under Digital OSF3.2. The spurce code and documentation are available from ftp:// www.sanger.ac.uky ftp/pub/ badger/est_genome.2.tar.Z. The prediction of genes in uncharacterized genomic DNA sequence is currently one of the main problems facing sequence annotators. Methods based on de novo prediction, e.g. searching for motifs like the splice-site consensus, or on statistical properties such as biased codon usage, etc. (Solovyev et al., 1994; Hebsgaard et al., 1996) have been only partially successful, and investigators have often found that the surest way of predicting a gene is by alignment with a homologous protein sequence (Birney et al., 1996; Gelfand et al., 1996; Huang and Zhang, 1996), or a spliced gene product [an expressed sequence tag (EST), mRNA or cDNA], particularly now that a large number of ESTs are available (Hillier et al., 1996). Standard alignment tools are not ideal for finding the correct alignment of a spliced product to genomic DNA, because of the large introns which can occur in the genomic sequence and because the programs ignore the conserved sequences found at donor/acceptor splice sites (intron/exon boundaries). In addition, very large genomic DNA sequences can be hard to align using quadratic-space dynamic programming because they require too much memory. The program EST_GENOME addresses this problem. It allows large introns, can recognize splice sites and uses limited memory. This combination of features makes a powerful and useful tool. EST_GENOME is used routinely at the Sanger Centre to help annotate human genomic sequence. As it is slow compared with search methods like BLAST (Altschul et al., 1990), we first screen genomic DNA against dbEST using BLASTN. Any matching ESTs are realigned using EST_GENOME. The algorithm uses a modification of Smith and Waterman (1981). The penalty structure used to score an alignment is as follows (defaults are in parentheses). Aligned bases score +match (1) or cost —mismatch (1) as appropriate. An indel in

281 citations

Journal ArticleDOI
TL;DR: This second release of the DGC (DGCr2) contains 5061 additional clones, extending the collection to 10,910 cDNAs representing >70% of the predicted genes in Drosophila.
Abstract: The identification of all expressed genes and the structure(s) of their transcripts are prerequisites for many structural and functional genomic studies. Gene-finding programs are valuable tools for identifying gene structure, but they are error-prone and suffer from the inability to predict untranslated regions (UTRs) (Ashburner 2000; Reese et al. 2000). Direct analysis of gene transcripts is the only proven way to establish gene structures with confidence. Generating a collection of expressed sequence tags (ESTs) from high quality cDNA libraries is a widely used approach for acquiring this information (Adams et al. 1991). The sequences of ESTs and full-length nonredundant cDNA collections provide ideal tools for genome annotation and for the further training of gene prediction algorithms. Our first D.melanogaster EST project yielded putative full-length clones corresponding to >5000 different genes (Rubin et al. 2000). This was accomplished by generating 79,636 5′ ESTs from libraries, derived from four different tissues and the Schneider-2 cultured cell line, that contained a high proportion of full-length clones. These 5′ ESTs were clustered by inter se comparison, and the clone that extended the farthest 5′ in each cluster was selected for further analysis. From these clones, 3′ ESTs were then generated, and any clone not containing a polyA tail or that was redundant with another selected clone was eliminated. This collection, the Drosophila Gene Collection Release 1 (DGCr1) comprises full-length clones from ∼40% of the 13,474 genes predicted in D. melanogaster To obtain cDNA clones for the remaining genes, we generated 5′ ESTs from another 157,835 clones. Because our goal is a collection of full-length cDNA clones, we require that the libraries from which the ESTs are generated have a high percentage of full-length clones. Improvements in the methodology for cDNA library construction, such as the use of the reverse transcriptase-stabilizing additive trehalose and 5′ cap-trapping methods, have greatly increased the efficiency of generating full-length clones (Carninci et al. 1998; Sugahara et al. 2001). Normalization of cDNA libraries by decreasing the prevalence of clones representing abundant transcripts before sequencing can be used to increase gene discovery rates (Bonaldo et al. 1996; Carninci et al. 2000; Clark et al. 2001). Cap-trapped and normalized embryonic (RE) and head (RH) libraries were constructed and used to generate >115,000 new ESTs. Another requirement for generating a well-represented cDNA collection is to sample from many different tissue types and developmental stages. In addition to sequencing more clones from the S2 and ovary libraries, both of which appear to be good candidates for gene discovery and were not heavily targeted in creating the DGCr1, we also generated 23,215 ESTs from a non-normalized library derived from the adult testis (the AT library). Previous studies (Andrews et al. 2000) indicated that Drosophila testes are a rich source of novel ESTs. All ESTs in our collection were aligned to Release 2 of the D. melanogaster genomic sequence (Adams et al. 2000) using the cDNA alignment tool Sim4 (Florea et al. 1998). A stringent clustering algorithm using the coordinates from the Sim4 alignments identified 5061 additional putative full-length clones. The DGC now consists of 10,910 cDNA clones and is estimated to contain cDNAs for >70% of the predicted genes in the Release 2 annotation of the D. melanogaster genome.

211 citations

Journal ArticleDOI
Hongyu Zhang1
TL;DR: A two-step method, i.e. a BLAST step plus an LIS step, to align thousands of cDNA and protein sequences into the human genome map using a mature computational algorithm, Longest Increasing Subsequence (LIS) algorithm.
Abstract: Motivation The popular BLAST algorithm is based on a local similarity search strategy, so its high-scoring segment pairs (HSPs) do not have global alignment information. When scientists use BLAST to search for a target protein or DNA sequence in a huge database like the human genome map, the existence of repeated fragments, homologues or pseudogenes in the genome often makes the BLAST result filled with redundant HSPs. Therefore, we need a computational strategy to alleviate this problem. Results In the gene discovery group of Celera Genomics, I developed a two-step method, i.e. a BLAST step plus an LIS step, to align thousands of cDNA and protein sequences into the human genome map. The LIS step is based on a mature computational algorithm, Longest Increasing Subsequence (LIS) algorithm. The idea is to use the LIS algorithm to find the longest series of consecutive HSPs in the BLAST output. Such a BLAST+LIS strategy can be used as an independent alignment tool or as a complementary tool for other alignment programs like Sim4 and GenWise. It can also work as a general purpose BLAST result processor in all sorts of BLAST searches. Two examples from Celera were shown in this paper.

41 citations

Network Information
Related Topics (5)
Genome
74.2K papers, 3.8M citations
79% related
Intron
23.8K papers, 1.3M citations
76% related
Gene
211.7K papers, 10.3M citations
76% related
Sequence analysis
24.1K papers, 1.3M citations
75% related
Exon
38.3K papers, 1.7M citations
74% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20091
20071
20054
20041
20034
20024