scispace - formally typeset
Search or ask a question

Showing papers on "GenBank published in 2002"


01 Jan 2002
TL;DR: The following sequences have been submitted to the NomenClature Committee since the May 2002 nomenclature update and, following agreed policy, have been assigned of-cial allele designations.
Abstract: The following sequences have been submitted to the Nomenclature Committee since the May 2002 nomenclature update and, followingagreed policy, have been assigned official allele designations. Full details of all sequences will be published in a forthcoming report.Below are listed the newly assigned sequences, confirmations of previously reported sequences and some sequences which are correc-tions of those originally reported. The accession number of each sequence is given and these can be used to retrieve the sequence filesfrom either the EMBL, GenBank or DDBJ data libraries. Although accession numbers have been assigned by the data libraries and mostsequences are already available, there is still the possibility that an author may not yet have allowed the sequence to be released, in sucha case you will have to contact the submitting author directly.

912 citations


Journal Article
TL;DR: The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals.
Abstract: The National Center for Biotechnology Information(NCBI) is a big system of bioinfor-matics It provides data analysis and retrieval and resources that operate on the data in GenBank and a variety of other biological data made available through NCBI's Web site NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser Data analysis resources include BLAST, Electronic PCR, OrfFinder, etc It is a very important biological resources for us to do research

578 citations


Journal ArticleDOI
TL;DR: TTGE and random cloning indicated that certain phylogenetic subgroups were common to all birds analyzed, but sequence data from random cloning also provided evidence for qualitative and quantitative differences among the cecal microbiota of individual birds reared under very similar conditions.
Abstract: The microbiota of the intestinal tract of chickens plays an important role in inhibiting the establishment of intestinal pathogens. Earlier culturing and microscopic examinations indicated that only a fraction of the bacteria in the cecum of chickens could be grown in the laboratory. Therefore, a survey of cecal bacteria was done by retrieval of 16S rRNA gene sequences from DNA isolated from the cecal content and the cecal mucosa. The ribosomal gene sequences were amplified with universal primers and cloned or subjected to temporal temperature gradient gel electrophoresis (TTGE). Partial 16S rRNA gene sequences were determined from the clones and from the major bands in TTGE gels. A total of 1,656 partial 16S rRNA gene sequences were obtained and compared to sequences in the GenBank. The comparison indicated that 243 different sequences were present in the samples. Overall, sequences representing 50 phylogenetic groups or subgroups of bacteria were found, but approximately 89% of the sequences represented just four phylogenetic groups (Clostridium leptum, Sporomusa sp., Clostridium coccoides, and enterics). Sequences of members of the Bacteroides group, the Bifidobacterium infantis subgroup, and of Pseudomonas sp. each accounted for less than 2% of the total. Sequences related to those from the Escherichia sp. subgroup and from Lactobacillus, Pseudomonas, and Bifidobacterium spp. were generally between 98 and 100% identical to sequences already deposited in the GenBank. Sequences most closely related to those of the other bacteria were generally 97% or less identical to those in the databases and therefore might be from currently unknown species. TTGE and random cloning indicated that certain phylogenetic subgroups were common to all birds analyzed, but sequence data from random cloning also provided evidence for qualitative and quantitative differences among the cecal microbiota of individual birds reared under very similar conditions.

454 citations


Journal ArticleDOI
TL;DR: It is indicated that the erythrovirus group is more diverse than thought previously and can be divided into three well-individualized genotypes, with B19 viruses corresponding to genotype 1 and V9-related viruses being distributed into genotypes 2 and 3.
Abstract: B19 virus is a human virus belonging to the genus Erythrovirus. The genetic diversity among B19 virus isolates has been reported to be very low, with less than 2% nucleotide divergence in the whole genome sequence. We have previously reported the isolation of a human erythrovirus isolate, termed V9, whose sequence was markedly distinct (>11% nucleotide divergence) from that of B19 virus. To date, the V9 isolate remains the unique representative of a new variant in the genus Erythrovirus, and its taxonomic position is unclear. We report here the isolation of 11 V9-related viruses. A prospective study conducted in France between 1999 and 2001 indicates that V9-related viruses actually circulate at a significant frequency (11.4%) along with B19 viruses. Analysis of the nearly full-length genome sequence of one V9-related isolate (D91.1) indicates that the D91.1 sequence clusters together with but is notably distant from the V9 sequence (5.3% divergence) and is distantly related to B19 virus sequences (13.8 to 14.2% divergence). Additional phylogenetic analysis of partial sequences from the V9-related isolates combined with erythrovirus sequences available in GenBank indicates that the erythrovirus group is more diverse than thought previously and can be divided into three well-individualized genotypes, with B19 viruses corresponding to genotype 1 and V9-related viruses being distributed into genotypes 2 and 3.

276 citations


Journal ArticleDOI
TL;DR: The conclusion is the old saw that the authors share 98.5% of their DNA sequence with chimpanzee is probably in error, and a better estimate would be that 95% of the base pairs are exactly shared between chimpanzee and human DNA.
Abstract: Five chimpanzee bacterial artificial chromosome (BAC) sequences (described in GenBank) have been compared with the best matching regions of the human genome sequence to assay the amount and kind of DNA divergence. The conclusion is the old saw that we share 98.5% of our DNA sequence with chimpanzee is probably in error. For this sample, a better estimate would be that 95% of the base pairs are exactly shared between chimpanzee and human DNA. In this sample of 779 kb, the divergence due to base substitution is 1.4%, and there is an additional 3.4% difference due to the presence of indels. The gaps in alignment are present in about equal amounts in the chimp and human sequences. They occur equally in repeated and nonrepeated sequences, as detected by REPEATMASKER (http://ftp.genome.washington.edu/RM/RepeatMasker.html).

266 citations


Journal ArticleDOI
TL;DR: The characterization of ∼3000 non-redundant cDNAs from a clonal line of the planarian Schmidtea mediterranea, and the ability of abrogating gene expression in planarians using RNA interference technology, pave the way for a systematic study of the remarkable biological properties displayed by Platyhelminthes.
Abstract: Platyhelminthes are excellent models for the study of stem cell biology, regeneration and the regulation of scale and proportion. In addition, parasitic forms infect millions of people worldwide. Therefore, it is puzzling that they remain relatively unexplored at the molecular level. We present the characterization of approximately 3,000 non-redundant cDNAs from a clonal line of the planarian Schmidtea mediterranea. The obtained cDNA sequences, homology comparisons and high-throughput whole-mount in situ hybridization data form part of the S. mediterranea database (SmedDb; http://planaria.neuro.utah.edu). Sixty-nine percent of the cDNAs analyzed share similarities with sequences deposited in GenBank and dbEST. The remaining gene transcripts failed to match sequences in other organisms, even though a large number of these (approximately 80%) contained putative open reading frames. Taken together, the molecular resources presented in this study, along with the ability of abrogating gene expression in planarians using RNA interference technology, pave the way for a systematic study of the remarkable biological properties displayed by Platyhelminthes.

263 citations


Journal ArticleDOI
TL;DR: It is concluded that thermophiles are a biologically and phylogenetically divergent group of prokaryotes that have converged to sustain extreme environmental conditions over evolutionary timescale.
Abstract: Thermoanaerobacter tengcongensis is a rod-shaped, gram-negative, anaerobic eubacterium that was isolated from a freshwater hot spring in Tengchong, China. Using a whole-genome-shotgun method, we sequenced its 2,689,445-bp genome from an isolate, MB4(T) (Genbank accession no. AE008691). The genome encodes 2588 predicted coding sequences (CDS). Among them, 1764 (68.2%) are classified according to homology to other documented proteins, and the rest, 824 CDS (31.8%), are functionally unknown. One of the interesting features of the T. tengcongensis genome is that 86.7% of its genes are encoded on the leading strand of DNA replication. Based on protein sequence similarity, the T. tengcongensis genome is most similar to that of Bacillus halodurans, a mesophilic eubacterium, among all fully sequenced prokaryotic genomes up to date. Computational analysis on genes involved in basic metabolic pathways supports the experimental discovery that T. tengcongensis metabolizes sugars as principal energy and carbon source and utilizes thiosulfate and element sulfur, but not sulfate, as electron acceptors. T. tengcongensis, as a gram-negative rod by empirical definitions (such as staining), shares many genes that are characteristics of gram-positive bacteria whereas it is missing molecular components unique to gram-negative bacteria. A strong correlation between the G + C content of tDNA and rDNA genes and the optimal growth temperature is found among the sequenced thermophiles. It is concluded that thermophiles are a biologically and phylogenetically divergent group of prokaryotes that have converged to sustain extreme environmental conditions over evolutionary timescale.

236 citations


Journal ArticleDOI
TL;DR: The Medicago truncatula expressed sequence tag (EST) database (Gene Index) contains over 140,000 sequences from 30 cDNA libraries and offers the possibility of identifying previously uncharacterized genes and assessing the frequency and tissue specificity of their expression in silico.
Abstract: The Medicago truncatula expressed sequence tag (EST) database (Gene Index) contains over 140,000 sequences from 30 cDNA libraries. This resource offers the possibility of identifying previously uncharacterized genes and assessing the frequency and tissue specificity of their expression in silico. Because M. truncatula forms symbiotic root nodules, unlike Arabidopsis, this is a particularly important approach in investigating genes specific to nodule development and function in legumes. Our analyses have revealed 340 putative gene products, or tentative consensus sequences (TCs), expressed solely in root nodules. These TCs were represented by two to 379 ESTs. Of these TCs, 3% appear to encode novel proteins, 57% encode proteins with a weak similarity to the GenBank accessions, and 40% encode proteins with strong similarity to the known proteins. Nodule-specific TCs were grouped into nine categories based on the predicted function of their protein products. Besides previously characterized nodulins, other examples of highly abundant nodule-specific transcripts include plantacyanin, agglutinin, embryo-specific protein, and purine permease. Six nodule-specific TCs encode calmodulin-like proteins that possess a unique cleavable transit sequence potentially targeting the protein into the peribacteroid space. Surprisingly, 114 nodule-specific TCs encode small Cys cluster proteins with a cleavable transit peptide. To determine the validity of the in silico analysis, expression of 91 putative nodule-specific TCs was analyzed by macroarray and RNA-blot hybridizations. Nodule-enhanced expression was confirmed experimentally for the TCs composed of five or more ESTs, whereas the results for those TCs containing fewer ESTs were variable.

231 citations


Journal ArticleDOI
15 Sep 2002-Blood
TL;DR: A molecular resource of genes expressed in primary malignant plasma cells using a combination of cDNA library construction, 5' end single-pass sequencing, bioinformatics, and microarray analysis, which contains numerous genes of unknown function and may complement other commercially available arrays in defining the molecular portrait of this hematopoietic malignancy.

211 citations


Journal ArticleDOI
TL;DR: Hydroxyapatite chromatography was used to fractionate sorghum genomic DNA into highly repetitive, moderately repetitive, and single/low-copy sequence components that were consequently cloned to produce HRCot, MRCot, and SLCot genomic libraries.
Abstract: Cot-based sequence discovery represents a powerful means by which both low-copy and repetitive sequences can be selectively and efficiently fractionated, cloned, and characterized. Based upon the results of a Cot analysis, hydroxyapatite chromatography was used to fractionate sorghum (Sorghum bicolor) genomic DNA into highly repetitive (HR), moderately repetitive (MR), and single/low-copy (SL) sequence components that were consequently cloned to produce HRCot, MRCot, and SLCot genomic libraries. Filter hybridization (blotting) and sequence analysis both show that the HRCot library is enriched in sequences traditionally found in high-copy number (e.g., retroelements, rDNA, centromeric repeats), the SLCot library is enriched in low-copy sequences (e.g., genes and “nonrepetitive ESTs”), and the MRCot library contains sequences of moderate redundancy. The Cot analysis suggests that the sorghum genome is approximately 700 Mb (in agreement with previous estimates) and that HR, MR, and SL components comprise 15%, 41%, and 24% of sorghum DNA, respectively. Unlike previously described techniques to sequence the low-copy components of genomes, sequencing of Cot components is independent of expression and methylation patterns that vary widely among DNA elements, developmental stages, and taxa. High-throughput sequencing of Cot clones may be a means of “capturing” the sequence complexity of eukaryotic genomes at unprecedented efficiency. [Online supplementary material is available at www.genome.org. The sequence data described in this paper have been submitted to the GenBank under accession nos. AZ921847-AZ923007. Reagents, samples, and unpublished information freely provided by H. Ma and J. Messing.]

186 citations


Journal ArticleDOI
TL;DR: An expressed sequence tag library was constructed from hemocytes of the black tiger shrimp to identify genes associated with immunity in this economically important species and three full-length ESTs encoding antimicrobial peptides and a heat shock protein are reported.
Abstract: An expressed sequence tag (EST) library was constructed from hemocytes of the black tiger shrimp (Penaeus monodon) to identify genes associated with immunity in this economically important species. The number of complementary DNA clones in the constructed library was approximately 4 x 10(5). Of these, 615 clones having inserts larger than 500 bp were unidirectionally sequenced and analyzed by homology searches against data in GenBank. Significant homology to known genes was found in 314 (51%) of the 615 clones, but the remaining 301 sequences (49%) did not match any sequence in GenBank. Approximately 35% of the matched ESTs were significantly identified by the BLASTN and BLASTX programs, while 65% were recognized only by the BLASTX program. Of the 615 clones, 55 (8.9%) were identified as putative immune-related genes. The isolated genes were composed of those coding for enzymes and proteins in the clotting system and the prophenoloxidase-activating system, antioxidative enzymes, antimicrobial peptides, and serine proteinase inhibitors. Three full-length ESTs encoding antimicrobial peptides (antilipopolysaccharide and penaeidin homologues) and a heat shock protein (cpn10 homologue) are reported.

Journal ArticleDOI
TL;DR: All of the marine bacterioplankton-derived 16S ribosomal DNA sequences previously deposited in GenBank were reanalyzed to determine the number of bacterial species in the oceanic surface waters, concluding that the apparent bacteriopLankton species richness is relatively low.
Abstract: All of the marine bacterioplankton-derived 16S ribosomal DNA sequences previously deposited in GenBank were reanalyzed to determine the number of bacterial species in the oceanic surface waters. These sequences have been entered into the database since 1990. The rate of new additions reached a peak in 1999 and subsequently leveled off, suggesting that much of the marine microbial species richness has been sampled. When the GenBank sequences were dereplicated by using 97% similarity as a cutoff, 1,117 unique ribotypes were found. Of the unique sequences, 609 came from uncultured environmental clones and 508 came from cultured bacteria. We conclude that the apparent bacterioplankton species richness is relatively low.

Journal ArticleDOI
TL;DR: HGVbase (Human Genome Variation database; http://hgvbase.cgb.ki.se) is an academic effort to provide a high quality and non-redundant database of available genomic variation data of all types, mostly comprising single nucleotide polymorphisms (SNPs).
Abstract: HGVbase (Human Genome Variation database; http://hgvbase.cgb.ki.se, formerly known as HGBASE) is an academic effort to provide a high quality and non-redundant database of available genomic variation data of all types, mostly comprising single nucleotide polymorphisms (SNPs). Records include neutral polymorphisms as well as disease-related mutations. Online search tools facilitate data interrogation by sequence similarity and keyword queries, and searching by genome coordinates is now being implemented. Downloads are freely available in XML, Fasta, SRS, SQL and tagged-text file formats. Each entry is presented in the context of its surrounding sequence and many records are related to neighboring human genes and affected features therein. Population allele frequencies are included wherever available. Thorough semi-automated data checking ensures internal consistency and addresses common errors in the source information. To keep pace with recent growth in the field, we have developed tools for fully automated annotation. All variants have been uniquely mapped to the draft genome sequence and are referenced to positions in EMBL/GenBank files. Data utility is enhanced by provision of genotyping assays and functional predictions. Recent data structure extensions allow the capture of haplotype and genotype information, and a new initiative (along with BiSC and HUGO-MDI) aims to create a central repository for the broad collection of clinical mutations and associated disease phenotypes of interest.

Journal ArticleDOI
TL;DR: Evidence that the P. tricornutum genome is likely to be small and nucleotide sequence data reported has been deposited in GenBank Nucleotide Sequence Database supports the development of this species as a model system for molecular-based studies of diatom biology.
Abstract: Diatoms are a ubiquitous class of microalgae of extreme importance for global primary productivity and for the biogeochemical cycling of minerals such as silica. However, very little is known about diatom cell biology or about their genome structure. For diatom researchers to take advantage of genomics and post-genomics technologies, it is necessary to establish a model diatom species. Phaeodactylum tricornutum is an obvious candidate because of its ease of culture and because it can be genetically transformed. Therefore, we have examined its genome composition by the generation of approximately 1,000 expressed sequence tags. Although more than 60% of the sequences could not be unequivocally identified by similarity to sequences in the databases, approximately 20% had high similarity with a range of genes defined functionally at the protein level. It is interesting that many of these sequences are more similar to animal rather than plant counterparts. Base composition at each codon position and GC content of the genome were compared with Arabidopsis, maize (Zea mays), and Chlamydomonas reinhardtii. It was found that distribution of GC within the coding sequences is as homogeneous in P. tricornutum as in Arabidopsis, but with a slightly higher GC content. Furthermore, we present evidence that the P. tricornutum genome is likely to be small (less than 20 Mb). Therefore, this combined information supports the development of this species as a model system for molecular-based studies of diatom biology. The nucleotide sequence data reported has been deposited in GenBank Nucleotide Sequence Database (dbEST section) under accession nos. BI306757 through BI307753.

Journal ArticleDOI
TL;DR: A new automated annotation system and database called Rice Genome Automated Annotation System (RiceGAAS) has been developed to execute a reliable and up-to-date analysis of the genome sequence as well as to store and retrieve the results of annotation.
Abstract: An extensive effort of the International Rice Genome Sequencing Project (IRGSP) has resulted in rapid accumulation of genome sequence, and >137 Mb has already been made available to the public domain as of August 2001. This requires a high-throughput annotation scheme to extract biologically useful and timely information from the sequence data on a regular basis. A new automated annotation system and database called Rice Genome Automated Annotation System (RiceGAAS) has been developed to execute a reliable and up-to-date analysis of the genome sequence as well as to store and retrieve the results of annotation. The system has the following functional features: (i) collection of rice genome sequences from GenBank; (ii) execution of gene prediction and homology search programs; (iii) integration of results from various analyses and automatic interpretation of coding regions; (iv) re-execution of analysis, integration and automatic interpretation with the latest entries in reference databases; (v) integrated visualization of the stored data using web-based graphical view. RiceGAAS also has a data submission mechanism that allows public users to perform fully automated annotation of their own sequences. The system can be accessed at http://RiceGAAS.dna.affrc.go.jp/.

Journal ArticleDOI
20 Jan 2002-Virology
TL;DR: The results from this study indicated that TFV may belong to the genus Ranavirus of the family Iridoviridae.

Journal ArticleDOI
TL;DR: Evaluation of the sequence relationships between these proteins contributes contextual information that enhances understanding of individual family members and provides tools and a framework for the further characterization of this important class of enzymes.
Abstract: Background: Eukaryotic protein kinases (EPKs) constitute one of the largest recognized protein families represented in the human genome. EPKs, which are similar to each other in sequence, structure and biochemical properties, are important players in virtually every signaling pathway involved in normal development and disease. Near completion of projects to sequence the human genome and transcriptome provide an opportunity to identify and perform sequence analysis on a nearly complete set of human EPKs. Results: Publicly available genetic sequence data were searched for human sequences that potentially represent EPK family members. After removal of duplicates, splice variants and pseudogenes, this search yielded 510 sequences with recognizable similarity to the EPK family. Protein sequences of putative EPK catalytic domains identified in the search were aligned, and a phenogram was constructed based on the alignment. Representative sequence records in GenBank were identified, and derived information about gene mapping and nomenclature was summarized. Conclusions: This work represents a nearly comprehensive census and early bioinformatics overview of the EPKs encoded in the human genome. Evaluation of the sequence relationships between these proteins contributes contextual information that enhances understanding of individual family members. This curation of human EPK sequences provides tools and a framework for the further characterization of this important class of enzymes.

Journal ArticleDOI
TL;DR: The sequence information and SSR loci generated through this study will be valuable for application to sorghum genetics and improvement, including gene discovery, marker-assisted selection, diversity and pedigree analyses, comparative mapping and evolutionary genetic studies.
Abstract: In this study, we collected and analyzed DNA sequence data for 789 previously mapped RFLP probes from Sorghum bicolor (L.) Moench. DNA sequences, comprising 894 non-redundant contigs and end sequences, were searched against three GenBank databases, nucleotide (nt), protein (nr) and EST (dbEST), using BLAST algorithms. Matching ESTs were also searched against nt and nr. Translated DNA sequences were then searched against the conserved domain database (CDD) to determine if functional domains/motifs were congruent with the proteins identified in previous searches. More than half (500/894 or 56%) of the query sequences had significant matches in at least one of the GenBank searches. Overall, proteins identified for 148 sequences (17%) were consistent among all searches, of which 66 sequences (7%) contained congruent coding domains. The RFLP probe sequences were also evaluated for the presence of simple sequence repeats (SSRs) and 60 SSRs were developed and assayed in an array of sorghum germplasm comprising inbreds, landraces and wild relatives. Overall, these SSR loci had lower levels of polymorphism (D = 0.46, averaged over 51 polymorphic loci) compared with sorghum SSRs that were isolated by library hybridization screens (D = 0.69, averaged over 38 polymorphic loci). This result was probably due to the relatively small proportion of di-nucleotide repeat-containing markers (42% of the total SSR loci) obtained from the DNA sequence data. These di-nucleotide markers also contained shorter repeat motifs than those isolated from genomic libraries. Based on BLAST results, 24 SSRs (40%) were located within, or near, previously annotated or hypothetical genes. We determined the location of 19 of these SSRs relative to putative coding regions. In general, SSRs located in coding regions were less polymorphic (D = 0.07, averaged over three loci) than those from gene flanking regions, UTRs and introns (D = 0.49, averaged over 16 loci). The sequence information and SSR loci generated through this study will be valuable for application to sorghum genetics and improvement, including gene discovery, marker-assisted selection, diversity and pedigree analyses, comparative mapping and evolutionary genetic studies.

Journal ArticleDOI
TL;DR: The development of GO Engine, a computational platform for GO annotation, and analysis of the resultant GO annotations of human proteins are reported, which centered on sequence homology with GO-annotated proteins and protein domain analysis.
Abstract: Recent progress in genomic sequencing, computational biology, and ontology development has presented an opportunity to investigate biological systems from a unique perspective, that is, examining genomes and transcriptomes through the multiple and hierarchical structure of Gene Ontology (GO). We report here our development of GO Engine, a computational platform for GO annotation, and analysis of the resultant GO annotations of human proteins. Protein annotation was centered on sequence homology with GO-annotated proteins and protein domain analysis. Text information analysis and a multiparameter cellular localization predictive tool were also used to increase the annotation accuracy, and to predict novel annotations. The majority of proteins corresponding to full-length mRNA in GenBank, and the majority of proteins in the NR database (nonredundant database of proteins) were annotated with one or more GO nodes in each of the three GO categories. The annotations of GenBank and SWISS-PROT proteins are available to the public at the GO Consortium web site.

Journal Article
TL;DR: The lens cDNA libraries are a resource for gene discovery, full length cDNAs for functional studies and microarrays, and the discovery of an abundant, novel transcript, lengsin, and a major novel splice form of MP19 reflect the utility of unamplified libraries constructed from dissected tissue.
Abstract: PURPOSE To explore the expression profile of the human lens and to provide a resource for microarray studies, expressed sequence tag (EST) analysis has been performed on cDNA libraries from adult lenses. METHODS A cDNA library was constructed from two adult (40 year old) human lenses. Over two thousand clones were sequenced from the unamplified, un-normalized library. The library was then normalized and a further 2200 sequences were obtained. All the data were analyzed using GRIST (GRouping and Identification of Sequence Tags), a procedure for gene identification and clustering. RESULTS The lens library (by) contains a low percentage of non-mRNA contaminants and a high fraction (over 75%) of apparently full length cDNA clones. Approximately 2000 reads from the unamplified library yields 810 clusters, potentially representing individual genes expressed in the lens. After normalization, the content of crystallins and other abundant cDNAs is markedly reduced and a similar number of reads from this library (fs) yields 1455 unique groups of which only two thirds correspond to named genes in GenBank. Among the most abundant cDNAs is one for a novel gene related to glutamine synthetase, which was designated "lengsin" (LGS). Analyses of ESTs also reveal examples of alternative transcripts, including a major alternative splice form for the lens specific membrane protein MP19. Variant forms for other transcripts, including those encoding the apoptosis inhibitor Livin and the armadillo repeat protein ARVCF, are also described. CONCLUSIONS The lens cDNA libraries are a resource for gene discovery, full length cDNAs for functional studies and microarrays. The discovery of an abundant, novel transcript, lengsin, and a major novel splice form of MP19 reflect the utility of unamplified libraries constructed from dissected tissue. Many novel transcripts and splice forms are represented, some of which may be candidates for genetic diseases.

Journal ArticleDOI
TL;DR: A comprehensive survey of the Pseudoviridae (Ty1/copia) retroelement family was conducted using the GenBank sequence database and completed genome sequences of several model organisms, and it is proposed that this monophyletic lineage defines a new Pseudviridae genus, herein referred to as the AGROVIRUS.
Abstract: A comprehensive survey of the Pseudoviridae (Ty1/copia) retroelement family was conducted using the GenBank sequence database and completed genome sequences of several model organisms. Plant genomes were the most abundant sources of Pseudoviridae, with the Arabidopsis thaliana genome having 276 distinct elements. A reverse transcriptase amino acid sequence phylogeny indicated that the Pseudoviridae comprises highly divergent members. Coding sequences for a representative subset of elements were analyzed to identify conserved domains and differences that may underlie functional divergence. With the exception of some fungal elements (e.g., Ty1), most Pseudoviridae encode Gag and Pol on a single open reading frame. In addition to the nearly ubiquitous RNA-binding motif of nucleocapsid, three new conserved domains were identified in Gag. pol-encoded aspartic protease was similar to the retroviral enzyme and could be mapped onto the HIV-1 structure. Pol was highly conserved throughout the family. The greatest divergence among Pol sequences was seen in the C-terminus of integrase (IN). We defined a large motif (GKGY) after the IN catalytic domain that is unique to the Pseudoviridae. Additionally, the extreme C-terminus of IN is rich in simple sequence motifs. A distinct lineage of Pseudoviridae in plants have envlike genes. This lineage has undergone a large expansion of Gag characterized by an alpha-helix-rich domain containing coiled-coil motifs. In several elements, this domain is flanked on both sides by RNA-binding domains. We propose that this monophyletic lineage defines a new Pseudoviridae genus, herein referred to as the AGROVIRUS:

Journal ArticleDOI
TL;DR: It was shown that the translation products of ORFs 0 and 3 are significantly more conserved than those of the overlapping ORFs 1 and 4, respectively, which allow the proposal of new hypotheses to explain the apparent genetic stability of PLRV and its evolution.
Abstract: In order to investigate the genetic diversity of Potato leafroll virus (PLRV), seven new complete genomic sequences of isolates collected worldwide were compared with the five sequences available in GenBank. Then, a restricted polymorphic region of the genome was chosen to further analyse new sequences. The sequences of PLRV open reading frames (ORFs) 3 and 4 were also compared with those of two other poleroviruses and the non-synonymous to synonymous substitution ratio distribution was analysed in overlapping and non-overlapping regions of the genome using maximum-likelihood models. Results confirmed that PLRV sequences from around the world are very closely related and showed that the region encoding protein P0 allowed the detection of three groups of isolates. When compared to other poleroviruses, PLRV was the most conserved in both ORFs 3 and 4. However, the results suggest that important events, such as deletion, mutation at a stop codon and intraspecific homologous recombination events, have occurred during the evolution of PLRV. Finally, it was shown that the translation products of ORFs 0 and 3 are significantly more conserved than those of the overlapping ORFs 1 and 4, respectively. All together, the results allow the proposal of new hypotheses to explain the apparent genetic stability of PLRV and its evolution.

Journal ArticleDOI
TL;DR: It is found that only trinucleotide repeats show repeat enrichment well above the threshold of statistical significance, and within the repeat regions slippage is more frequent than point mutations, whereas the ratio of silent versus recognizable point mutations is approximately the same as elsewhere in coding regions.
Abstract: Tandem repeats in GenBank primate nucleotide sequences annotated as protein coding regions are analyzed. It is found that only trinucleotide repeats show repeat enrichment well above the threshold of statistical significance. The statistics are improved by a simultaneous search for repeats on both the amino acid and nucleotide levels. The results of the analyses of natural sequences are interpreted by comparing them with the results of the computer simulation of the model dedicated to protein coding regions. According to the simulation results, a limited set of trinucleotides, that is, cgg, ccg, cag, and gaa repeats coding for polyalanine, polyglycine, polyproline, polyglutamine, and polylysine are prone to proliferation. It is also found that within the repeat regions slippage is more frequent by a factor of 10 than point mutations, whereas the ratio of silent versus recognizable point mutations is approximately the same as elsewhere in coding regions. The trinucleotide repeats cover slightly more than 0.3% of the protein coding regions of genes.

Journal ArticleDOI
TL;DR: Cluster analysis revealed that within-library redundancy is low, and comparison of all porcine ESTs with the human database suggests that the sequences from these two libraries represent portions of a significant number of independent pig genes.
Abstract: Genetic and environmental factors affect the efficiency of pork production by influencing gene expression during porcine reproduction, tissue development, and growth. The identification and functional analysis of gene products important to these processes would be greatly enhanced by the development of a database of expressed porcine gene sequence. Two normalized porcine cDNA libraries (MARC 1PIG and MARC 2PIG), derived respectively from embryonic and reproductive tissues, were constructed, sequenced, and analyzed. A total of 66,245 clones from these two libraries were 5?-end sequenced and deposited in GenBank. Cluster analysis revealed that within-library redundancy is low, and comparison of all porcine ESTs with the human database suggests that the sequences from these two libraries represent portions of a significant number of independent pig genes. A Porcine Gene Index (PGI), comprising 15,616 tentative consensus sequences and 31,466 singletons, includes all sequences in public repositories and has been developed to facilitate further comparative map development and characterization of porcine genes (http://www.tigr.org/tdb/ssgi/). The clones and sequences from these libraries provide a catalog of expressed porcine genes and a resource for development of high-density hybridization arrays for transcriptional profiling of porcine tissues. In addition, comparison of porcine ESTs with sequences from other species serves as a valuable resource for comparative map development. Both arrayed cDNA libraries are available for unrestricted public use.

Journal ArticleDOI
TL;DR: To examine the applicability ofgpd sequences in resolving relationships within Ascomycota, trees were calculated from 22 fungal gpd sequences obtained from GenBank together with the two Cladonia sequences using parsimony jackknifing and produced a tree very similar to that of the SSU rDNA data.
Abstract: Primers for amplification and sequencing of partial glyceraldehyde-3-phosphate dehydrogenase (gpd) gene were designed for lichenized fungi. The 5' gpd primer is most probably fungal specific, since a BLAST search in GenBank found identical sequences only from ascomycetous taxa, whereas the 3' gpd primer was more universal. Utility of the gpd primers and previously designed beta-tubulin primers was tested in nine lichen taxa. Both the gpd and beta-tubulin primer pairs amplified in most of the taxa examined: the gpd primers generated a c. 1100 nucleotide fragment, whereas the PCR product obtained from the beta-tubulin primers was c. 900 nucleotides long. The gpd amplification products of Cladonia arbuscula and C. rangiferina were sequenced and both were found to contain three introns, the length of which varied between 49 to 83 nucleotides. To examine the applicability of gpd sequences in resolving relationships within Ascomycota, trees were calculated from 22 fungal gpd sequences obtained from GenBank together with the two Cladonia sequences using parsimony jackknifing. The gpd tree was compared with the SSU rDNA tree of the respective species (or genera). A similar analysis of the beta-tubulin gene was not performed, because only a few beta-tublin sequences from the same taxa were available in GenBank. The gpd tree was well resolved but in conflict with the SSU rDNA tree. In contrast to the SSU rDNA tree, the gpd tree did not support the monophyly of the Ascomycota. Analysis of the combined data set produced a tree very similar to that of the SSU rDNA data. However, the relationship of Lecanorales to the other orders remained unresolved. Even though gpd and beta-tubulin are highly conserved proteins, the third codon positions and introns are variable and both genes have the potential for inferring phylogenetic relationships at the lower taxonomic levels in the lichenized fungi. The two genes may be useful even below species level, depending on the species investigated. (Less)

Journal ArticleDOI
01 Dec 2002-Genome
TL;DR: Homology modeling of structures using Torpedo californica and Drosophila melanogaster native acetylcholinesterase structure as main template indicated that the two cloned AChEs of Aphis gossypii might have different three-dimensional structures.
Abstract: Two acetylcholinesterase (AChE) genes, Ace1 and Ace2, have been cloned from cotton aphid, Aphis gossypii Glover, using the rapid amplification of cDNA ends (RACE) technique. To the best of our knowledge, this should be the first direct molecular evidence that multiple AChE genes exist in insects. The Ace1 gene was successfully amplified along its full length of 2371 bp. The open reading frame is 2031 bp long and encodes 676 amino acids (GenBank accession No. AF502082). The Ace2 gene was amplified as a mega-fragment of 2130 bp lacking part of 5'-end untranslated region (UTR). The open reading frame is 1992 bp long and ecodes a protein of 664 amino acids (GenBank accession No. AF502081). Both genes have the conserved amino acids and features shared by the AChE family, but share only 35% identity in amino acid sequence. The Ace1 gene is highly homologous to the AChE gene of Schizaphis graminum (AF321574) with 95% identity, and Ace2 to that of Myzus persicae (AF287291) with 92% identity. Phylogenetic analysis...

Journal ArticleDOI
TL;DR: Southern blot analyses showed that some of the RQ2 differential sequences are found in some other members of the order Thermotogales, but the distribution of these variable genes is patchy, suggesting frequent lateral gene transfer within the group.
Abstract: Comparisons between genomes of closely related bacteria often show large variations in gene content, even between strains of the same species. Such studies have focused mainly on pathogens; here, we examined Thermotoga maritima, a free-living hyperthermophilic bacterium, by using suppressive subtractive hybridization. The genome sequence of T. maritima MSB8 is available, and DNA from this strain served as a reference to obtain strain-specific sequences from Thermotoga sp. strain RQ2, a very close relative (∼96% identity for orthologous protein-coding genes, 99.7% identity in the small-subunit rRNA sequence). Four hundred twenty-six RQ2 subtractive clones were sequenced. One hundred sixty-six had no DNA match in the MSB8 genome. These differential clones comprise, in sum, 48 kb of RQ2-specific DNA and match 72 genes in the GenBank database. From the number of identical clones, we estimated that RQ2 contains 350 to 400 genes not found in MSB8. Assuming a similar genome size, this corresponds to 20% of the RQ2 genome. A large proportion of the RQ2-specific genes were predicted to be involved in sugar transport and polysaccharide degradation, suggesting that polysaccharides are more important as nutrients for this strain than for MSB8. Several clones encode proteins involved in the production of surface polysaccharides. RQ2 encodes multiple subunits of a V-type ATPase, while MSB8 possesses only an F-type ATPase. Moreover, an RQ2-specific MutS homolog was found among the subtractive clones and appears to belong to a third novel archaeal type MutS lineage. Southern blot analyses showed that some of the RQ2 differential sequences are found in some other members of the order Thermotogales, but the distribution of these variable genes is patchy, suggesting frequent lateral gene transfer within the group.

Journal ArticleDOI
TL;DR: A review of the P2Y receptor subtypes is presented considering both their sequences and the pharmacological profiles of the encoded receptors expressed in heterologous expression systems.

Journal ArticleDOI
TL;DR: The Histone Database (http://genome.nhgri.nih.gov/histones/) is a searchable, periodically updated collection of histone fold-containing sequences derived from sequence-similarity searches of public databases.
Abstract: Histone proteins are often noted for their high degree of sequence conservation. It is less often recognized that the histones are a heterogeneous protein family. Furthermore, several classes of non-histone proteins containing the histone fold motif exist. Novel histone and histone fold protein sequences continue to be added to public databases every year. The Histone Database (http://genome.nhgri.nih.gov/histones/) is a searchable, periodically updated collection of histone fold-containing sequences derived from sequence-similarity searches of public databases. Sequence sets are presented in redundant and non-redundant FASTA form, hotlinked to GenBank sequence files. Partial sequences are also now included in the database, which has considerably augmented its taxonomic coverage. Annotated alignments of full-length non-redundant sets of sequences are now available in both web-viewable (HTML) and downloadable (PDF) formats. The database also provides summaries of current information on solved histone fold structures, post-translational modifications of histones, and the human histone gene complement.

Journal ArticleDOI
30 Oct 2002-Gene
TL;DR: It is confirmed that introns have a tendency to be located toward the 5' end of the gene, and the distance from the start codon to the position of the intron is measured, finding that single introns prefer the location immediately after the startcodon in S. cerevisiae and P. falciparum.