scispace - formally typeset
Search or ask a question

Showing papers by "J. Craig Venter Institute published in 2003"


Journal ArticleDOI
24 Apr 2003-Nature
TL;DR: A high-quality draft sequence of the N. crassa genome is reported, suggesting that RIP has had a profound impact on genome evolution, greatly slowing the creation of new genes through genomic duplication and resulting in a genome with an unusually low proportion of closely related genes.
Abstract: Neurospora crassa is a central organism in the history of twentieth-century genetics, biochemistry and molecular biology. Here, we report a high-quality draft sequence of the N. crassa genome. The approximately 40-megabase genome encodes about 10,000 protein-coding genes—more than twice as many as in the fission yeast Schizosaccharomyces pombe and only about 25% fewer than in the fruitfly Drosophila melanogaster. Analysis of the gene set yields insights into unexpected aspects of Neurospora biology including the identification of genes potentially associated with red light photobiology, genes implicated in secondary metabolism, and important differences in Ca21 signalling as compared with plants and animals. Neurospora possesses the widest array of genome defence mechanisms known for any eukaryotic organism, including a process unique to fungi called repeat-induced point mutation (RIP). Genome analysis suggests that RIP has had a profound impact on genome evolution, greatly slowing the creation of new genes through genomic duplication and resulting in a genome with an unusually low proportion of closely related genes.

1,659 citations


Journal ArticleDOI
TL;DR: The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.
Abstract: The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the ~27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.

1,441 citations


Journal ArticleDOI
TL;DR: The complete genome sequence of the model bacterial pathogen Pseudomonas syringae pathovar tomato DC3000 (DC3000), which is pathogenic on tomato and Arabidopsis thaliana, is reported and 1,159 genes unique to DC3000 are revealed, of which 811 lack a known function.
Abstract: We report the complete genome sequence of the model bacterial pathogen Pseudomonas syringae pathovar tomato DC3000 (DC3000), which is pathogenic on tomato and Arabidopsis thaliana. The DC3000 genome (6.5 megabases) contains a circular chromosome and two plasmids, which collectively encode 5,763 ORFs. We identified 298 established and putative virulence genes, including several clusters of genes encoding 31 confirmed and 19 predicted type III secretion system effector proteins. Many of the virulence genes were members of paralogous families and also were proximal to mobile elements, which collectively comprise 7% of the DC3000 genome. The bacterium possesses a large repertoire of transporters for the acquisition of nutrients, particularly sugars, as well as genes implicated in attachment to plant surfaces. Over 12% of the genes are dedicated to regulation, which may reflect the need for rapid adaptation to the diverse environments encountered during epiphytic growth and pathogenesis. Comparative analyses confirmed a high degree of similarity with two sequenced pseudomonads, Pseudomonas putida and Pseudomonas aeruginosa, yet revealed 1,159 genes unique to DC3000, of which 811 lack a known function.

835 citations


Journal ArticleDOI
01 May 2003-Nature
TL;DR: Several chromosomally encoded proteins that may contribute to pathogenicity—including haemolysins, phospholipases and iron acquisition functions—and numerous surface proteins that might be important targets for vaccines and drugs are found.
Abstract: Bacillus anthracis is an endospore-forming bacterium that causes inhalational anthrax. Key virulence genes are found on plasmids (extra-chromosomal, circular, double-stranded DNA molecules) pXO1 (ref. 2) and pXO2 (ref. 3). To identify additional genes that might contribute to virulence, we analysed the complete sequence of the chromosome of B. anthracis Ames (about 5.23 megabases). We found several chromosomally encoded proteins that may contribute to pathogenicity--including haemolysins, phospholipases and iron acquisition functions--and identified numerous surface proteins that might be important targets for vaccines and drugs. Almost all these putative chromosomal virulence and surface proteins have homologues in Bacillus cereus, highlighting the similarity of B. anthracis to near-neighbours that are not associated with anthrax. By performing a comparative genome hybridization of 19 B. cereus and Bacillus thuringiensis strains against a B. anthracis DNA microarray, we confirmed the general similarity of chromosomal genes among this group of close relatives. However, we found that the gene sequences of pXO1 and pXO2 were more variable between strains, suggesting plasmid mobility in the group. The complete sequence of B. anthracis is a step towards a better understanding of anthrax pathogenesis.

813 citations


Journal ArticleDOI
TL;DR: The cloning of the major resistance gene RB in S. bulbocastanum is reported by using a map-based approach in combination with a long-range (LR)-PCR strategy, demonstrating that LR-PCR is a valuable approach to isolate genes that cannot be maintained in the bacterial artificial chromosome system.
Abstract: Late blight, caused by the oomycete pathogen Phytophthora infestans, is the most devastating potato disease in the world. Control of late blight in the United States and other developed countries relies extensively on fungicide application. We previously demonstrated that the wild diploid potato species Solanum bulbocastanum is highly resistant to all known races of P. infestans. Potato germplasm derived from S. bulbocastanum has shown durable and effective resistance in the field. Here we report the cloning of the major resistance gene RB in S. bulbocastanum by using a map-based approach in combination with a long-range (LR)-PCR strategy. A cluster of four resistance genes of the CC-NBS-LRR (coiled coil-nucleotide binding site-Leu-rich repeat) class was found within the genetically mapped RB region. Transgenic plants containing a LR-PCR product of one of these four genes displayed broad spectrum late blight resistance. The cloned RB gene provides a new resource for developing late blight-resistant potato varieties. Our results also demonstrate that LR-PCR is a valuable approach to isolate genes that cannot be maintained in the bacterial artificial chromosome system.

589 citations


Journal ArticleDOI
TL;DR: Analysis of the genome of Coxiella burnetii, Nine Mile phase I RSA493, a highly virulent zoonotic pathogen and category B bioterrorism agent, was sequenced by the random shotgun method, suggesting that the obligate intracellular lifestyle of C. burningetii may be a relatively recent innovation.
Abstract: The 1,995,275-bp genome of Coxiella burnetii, Nine Mile phase I RSA493, a highly virulent zoonotic pathogen and category B bioterrorism agent, was sequenced by the random shotgun method. This bacterium is an obligate intracellular acidophile that is highly adapted for life within the eukaryotic phagolysosome. Genome analysis revealed many genes with potential roles in adhesion, invasion, intracellular trafficking, host-cell modulation, and detoxification. A previously uncharacterized 13-member family of ankyrin repeat-containing proteins is implicated in the pathogenesis of this organism. Although the lifestyle and parasitic strategies of C. burnetii resemble that of Rickettsiae and Chlamydiae, their genome architectures differ considerably in terms of presence of mobile elements, extent of genome reduction, metabolic capabilities, and transporter profiles. The presence of 83 pseudogenes displays an ongoing process of gene degradation. Unlike other obligate intracellular bacteria, 32 insertion sequences are found dispersed in the chromosome, indicating some plasticity in the C. burnetii genome. These analyses suggest that the obligate intracellular lifestyle of C. burnetii may be a relatively recent innovation.

516 citations


Journal ArticleDOI
TL;DR: The genome of C.caviae was determined, representing the fourth species with a complete genome sequence from the Chlamydiaceae family of obligate intracellular bacterial pathogens, and enabling dissection of the roles played by niche-specific genes in these important bacterial pathogens.
Abstract: The genome of Chlamydophila caviae (formerly Chlamydia psittaci, GPIC isolate) (1 173 390 nt with a plasmid of 7966 nt) was determined, representing the fourth species with a complete genome sequence from the Chlamydiaceae family of obligate intracellular bacterial pathogens. Of 1009 annotated genes, 798 were conserved in all three other completed Chlamydiaceae genomes. The C.caviae genome contains 68 genes that lack orthologs in any other completed chlamydial genomes, including tryptophan and thiamine biosynthesis determinants and a ribose-phosphate pyrophosphokinase, the product of the prsA gene. Notable amongst these was a novel member of the virulence-associated invasin/intimin family (IIF) of Gram-negative bacteria. Intriguingly, two authentic frameshift mutations in the ORF indicate that this gene is not functional. Many of the unique genes are found in the replication termination region (RTR or plasticity zone), an area of frequent symmetrical inversion events around the replication terminus shown to be a hotspot for genome variation in previous genome sequencing studies. In C.caviae, the RTR includes several loci of particular interest including a large toxin gene and evidence of ancestral insertion(s) of a bacteriophage. This toxin gene, not present in Chlamydia pneumoniae, is a member of the YopT effector family of type III-secreted cysteine proteases. One gene cluster (guaBA-add) in the RTR is much more similar to orthologs in Chlamydia muridarum than those in the phylogenetically closest species C.pneumoniae, suggesting the possibility of horizontal transfer of genes between the rodent-associated Chlamydiae. With most genes observed in the other chlamydial genomes represented, C.caviae provides a good model for the Chlamydiaceae and a point of comparison against the human atherosclerosis-associated C.pneumoniae. This crucial addition to the set of completed Chlamydiaceae genome sequences is enabling dissection of the roles played by niche-specific genes in these important bacterial pathogens.

305 citations


Journal ArticleDOI
Yeisoo Yu1, Teri Rambo1, Jennifer Currie1, Christopher A. Saski1, H. R. Kim1, Kristi Collura1, S. Thompson1, J. Simmons1, Tae-Jin Yang1, G. Nah1, Ami Patel1, S. Thurmond1, David C Henry1, R. Oates1, Michael Palmer1, Gina L Pries1, J. Gibson1, H. Anderson1, M. Paradkar1, L. Crane1, J. Dale1, M. B. Carver1, Todd C. Wood1, David Frisch1, F. Engler1, F. Engler2, Carol Soderlund1, Carol Soderlund2, Lance E. Palmer3, L. Tetylman3, Lidia Nascimento3, M. de la Bastide3, Lori Spiegel3, Doreen Ware3, A. O'Shaughnessy3, Sujit Dike3, Neilay Dedhia3, R. Preston3, E. Huang3, K. Ferraro3, K. Kuit3, B. Miller3, Theresa Zutavern3, F. Katzenberger3, Stephanie Muller3, Vivekanand Balija3, Robert A. Martienssen3, Lincoln Stein3, Patrick Minx4, David W. Johnson4, Holly Cordum4, Elaine R. Mardis4, Zhukuan Cheng5, Jiming Jiang5, Richard K. Wilson4, W. R. McCombie3, Rod A. Wing1, Q. Yuan6, Q. Yuan7, S. Ouyang7, S. Ouyang6, Jaime Liu7, Jaime Liu6, K. M. Jones6, K. M. Jones7, K. Gansberger6, K. Gansberger7, K. Moffat6, K. Moffat7, J. Hill6, J. Hill7, T. Tsitrin7, T. Tsitrin6, L. Overton6, L. Overton7, J. Bera6, J. Bera7, M. Kim7, M. Kim6, Sheng Chih Jin6, Sheng Chih Jin7, L. Tallon7, L. Tallon6, A. Ciecko7, A. Ciecko6, G. Pai6, G. Pai7, S. van Aken7, S. van Aken6, T. Utterback7, T. Utterback6, S. Reidmuller7, S. Reidmuller6, J. Bormann6, J. Bormann7, T. Feldblyum6, T. Feldblyum7, J. Hsiao6, J. Hsiao7, V. Zismann6, V. Zismann7, S. Blunt6, S. Blunt7, A. de Vazeilles6, A. de Vazeilles7, T. Shaffer6, T. Shaffer7, H. Koo7, H. Koo6, B. Suh7, B. Suh6, Q. Yang6, Q. Yang7, B. Haas7, B. Haas6, J. Peterson7, J. Peterson6, Mihaela Pertea7, Mihaela Pertea6, N. Volfovsky6, N. Volfovsky7, N. Volfovsky8, J. Wortman7, J. Wortman6, O. White6, O. White7, Steven L. Salzberg8, Steven L. Salzberg6, Steven L. Salzberg7, Claire M. Fraser7, Claire M. Fraser6, C. Robin Buell7, Joachim Messing9, Rentao Song9, Galina Fuks9, Victor Llaca9, S. Kovchak9, S. Young9, John E. Bowers10, Andrew H. Paterson10, M. A. Johns11, L. Mao11, H. Pan12, Ralph A. Dean12 
06 Jun 2003-Science
TL;DR: In this article, the authors report the sequence of chromosome 10, the smallest of the 12 rice chromosomes (22.4 megabases), which contains 3471 genes and multiple insertions from organellar genomes were detected.
Abstract: Rice is the world's most important food crop and a model for cereal research. At 430 megabases in size, its genome is the most compact of the cereals. We report the sequence of chromosome 10, the smallest of the 12 rice chromosomes (22.4 megabases), which contains 3471 genes. Chromosome 10 contains considerable heterochromatin with an enrichment of repetitive elements on 10S and an enrichment of expressed genes on 10L. Multiple insertions from organellar genomes were detected. Collinearity was apparent between rice chromosome 10 and sorghum and maize. Comparison between the draft and finished sequence demonstrates the importance of finished sequence.

243 citations


Journal ArticleDOI
TL;DR: ToxoDB was designed to provide a central point of access for all available T. gondii data, and a variety of data mining tools useful for the analysis of unfinished, un-annotated draft sequence during the early phases of the genome project.
Abstract: ToxoDB (http://ToxoDB.org) provides a genome resource for the protozoan parasite Toxoplasma gondii. Several sequencing projects devoted to T. gondii have been completed or are in progress: an EST project (http://genome.wustl.edu/est/index.php?toxoplasma=1), a BAC clone end-sequencing project (http://www.sanger.ac.uk/Projects/T_gondii/) and an 8X random shotgun genomic sequencing project (http://www.tigr.org/tdb/e2k1/tga1/). ToxoDB was designed to provide a central point of access for all available T. gondii data, and a variety of data mining tools useful for the analysis of unfinished, un-annotated draft sequence during the early phases of the genome project. In later stages, as more and different types of data become available (microarray, proteomic, SNP, QTL, etc.) the database will provide an integrated data analysis platform facilitating user-defined queries across the different data types.

192 citations


Journal ArticleDOI
TL;DR: The timing of the Arabidopsis thaliana whole-genome duplication is estimated by means of phylogenetic and statistical analysis and suggests that the duplication is younger than 38 million years and may have contributed to theArabidopsis-Brassica divergence.
Abstract: We estimate the timing of the Arabidopsis thaliana whole-genome duplication by means of phylogenetic and statistical analysis, and propose two possible scenarios for the duplication The first one, based on the assumption that the duplicated segments diverged from an autotetraploid form, places the duplication at about 38 million years ago, after the Arabidopsis lineage diverged from that of soybean (Glycine max) and before it diverged from its sister genus, Brassica The second scenario assumes that the ancestor was allotetraploid, and suggests that the duplication is younger than 38 million years and may have contributed to the Arabidopsis-Brassica divergence In each case, our estimate places the age of the genome duplication as significantly younger than previously reported

100 citations


Journal ArticleDOI
TL;DR: Analysis of an ≈100-kb contiguous chromosome segment from five isolates shows that P. vivax has a highly diverse genome, and useful information is provided for further understanding the genome diversity of the parasite.
Abstract: The study of genetic variation in malaria parasites has practical significance for developing strategies to control the disease. Vaccines based on highly polymorphic antigens may be confounded by allelic restriction of the host immune response. In response to drug pressure, a highly plastic genome may generate resistant mutants more easily than a monomorphic one. Additionally, the study of the distribution of genomic polymorphisms may provide information leading to the identification of genes associated with traits such as parasite development and drug resistance. Indeed, the age and diversity of the human malaria parasite Plasmodium falciparum has been the subject of recent debate, because an ancient parasite with a complex genome is expected to present greater challenges for drug and vaccine development. The genome diversity of the important human pathogen Plasmodium vivax, however, remains essentially unknown. Here we analyze an ≈100-kb contiguous chromosome segment from five isolates, revealing 191 single-nucleotide polymorphisms (SNPs) and 44 size polymorphisms. The SNPs are not evenly distributed across the segment with blocks of high and low diversity. Whereas the majority (≈63%) of the SNPs are in intergenic regions, introns contain significantly less SNPs than intergenic sequences. Polymorphic tandem repeats are abundant and are more uniformly distributed at a frequency of about one polymorphic tandem repeat per 3 kb. These data show that P. vivax has a highly diverse genome, and provide useful information for further understanding the genome diversity of the parasite.

Journal Article
TL;DR: It is shown that mutant forms of APC can induce repression of select terminal caspases as a potential means of attenuating responses to apoptotic stimuli, and data provide support for the hypothesis that one of the functions ofAPC is the regulation of caspase activity and other apoptotic proteins by controlling their expression levels in the cell.
Abstract: The adenomatous polyposis coli (APC) gene, a member of the WNT pathway, has been shown to assign intestinal epithelial cells to a program of proliferation or differentiation through regulation of the β-catenin/TCF-4 complex Wild-type APC, in certain cellular contexts, appears to induce differentiation and apoptosis, although mutant forms of APC, known to produce polyps and ultimately cancers, may suppress these events Here, we show that mutant forms of APC can induce repression of select terminal caspases as a potential means of attenuating responses to apoptotic stimuli Using gene expression profiling to interrogate the intact intestines of Apc+/min mice harboring numerous polyps, we identified a reduction in the mRNA expression of both caspases 3 and 7 We additionally identified a reduction in protein levels of caspase-3, caspase-7, and caspase-9 in human colon cancer specimens known to harbor APC mutations A reduction in caspase protein levels resulted in resistance to apoptotic-inducing agents and restoration of caspase levels reinstated apoptotic capacities Consistent with Wnt pathway involvement, dominant negative TCF/LEF induced caspase protein expression These data provide support for the hypothesis that one of the functions of APC is the regulation of caspase activity and other apoptotic proteins by controlling their expression levels in the cell

Journal ArticleDOI
TL;DR: This work describes a computational method for the identification of micro-exons using near-perfect alignments between cDNA and genomic DNA sequences, which detected 319 micro- exons, of which 224 were previously unknown, human and nematode, and Arabidopsis thaliana.
Abstract: With the rapid increase in the generation of eukaryotic genome sequence data, the development of accurate, detailed computational methods for gene structure prediction and verification has become increasingly important. Many computational methods have addressed the difficult problem of exon-intron structure annotation (Florea et al. 1998; Usuka et al. 2000; Kan et al. 2001; Wheelan et al. 2001; Kent 2002). One area that remains particularly difficult is the identification of very short exons, both internal and terminal (Florea et al. 1998; Black 2000). These micro-exons often confound both alignment and ab initio gene prediction programs, because they contain virtually no meaningful statistical signal. Micro-exons with lengths up to 25 bp have been studied experimentally in plants, insects, and vertebrate animals (Reyes et al. 1991; McAlister et al. 1992; Sterner and Berget 1993; Chan and Black 1995; Simpson et al. 2000). These experimental studies support two important features of micro-exons; (1) they sometimes facilitate alternative splicing events, and (2) despite their small size, they are usually conserved between species (Berget 1995; Black 2000). The inclusion of micro-exons is dependent on stages of cell development and cell specificity. It is mediated by several factors, including intronic and exonic splicing enhancer and silencer elements, and SR and hnRNP proteins (Berget 1995; Black 2000; Simpson et al 2000). Experimental and computational studies of micro-exons are recognized as important challenges in the understanding of splicing machinery and genome variability. This study describes a computational method for finding internal micro-exons in near-perfect alignments of cDNA and genomic DNA sequences. The method provides quick recognition of micro-exons that might have been missed by an alignment program (called the short exon error problem by Florea et al. [1998]). We demonstrate the method on four complete genomes, C. elegans (The C. elegans Sequencing Consortium 1998), D. melanogaster (Adams et al. 2000), A. thaliana (The Arabidopsis Genome Initiative 2000) and human (Venter et al. 2001; Lander et al. 2001). In each case, significant numbers of micro-exons were identified that were undetected in previous annotation efforts. We discovered previously unreported micro-exons, alternative splicing events, and highly conserved micro-exons across several species.

Journal ArticleDOI
TL;DR: The sequence of chromosome II from Trypanosoma brucei, the causative agent of African sleeping sickness, is reported, suggesting that this region may be a site for modular de novo construction of VSG gene diversity during transposition/gene conversion events.
Abstract: We report here the sequence of chromosome II from Trypanosoma brucei, the causative agent of African sleeping sickness. The 1.2-Mb pairs encode about 470 predicted genes organised in 17 directional clusters on either strand, the largest cluster of which has 92 genes lined up over a 284-kb region. An analysis of the GC skew reveals strand compositional asymmetries that coincide with the distribution of protein-coding genes, suggesting these asymmetries may be the result of transcription-coupled repair on coding versus non-coding strand. A 5-cM genetic map of the chromosome reveals recombinational 'hot' and 'cold' regions, the latter of which is predicted to include the putative centromere. One end of the chromosome consists of a 250-kb region almost exclusively composed of RHS (pseudo)genes that belong to a newly characterised multigene family containing a hot spot of insertion for retroelements. Interspersed with the RHS genes are a few copies of truncated RNA polymerase pseudogenes as well as expression site associated (pseudo)genes (ESAGs) 3 and 4, and 76 bp repeats. These features are reminiscent of a vestigial variant surface glycoprotein (VSG) gene expression site. The other end of the chromosome contains a 30-kb array of VSG genes, the majority of which are pseudogenes, suggesting that this region may be a site for modular de novo construction of VSG gene diversity during transposition/gene conversion events.

Journal ArticleDOI
TL;DR: Three programs for ab initio gene prediction in eukaryotes: Exonomy, Unveil and GlimmerM are presented, which are readily re-trainable for new organisms and have been found to perform well compared to other genefinders.
Abstract: We present three programs for ab initio gene prediction in eukaryotes: Exonomy, Unveil and GlimmerM. Exonomy is a 23-state Generalized Hidden Markov Model (GHMM), Unveil is a 283-state standard Hidden Markov Model (HMM) and GlimmerM is a previously-described genefinder which utilizes decision trees and Interpolated Markov Models (IMMs). All three are readily re-trainable for new organisms and have been found to perform well compared to other genefinders. Results are presented for Arabidopsis thaliana. Cases have been found where each of the genefinders outperforms each of the others, demonstrating the collective value of this ensemble of genefinders. These programs are all accessible through webservers at http:// www.tigr.org/software.

Journal ArticleDOI
TL;DR: This report provides the first global view of the series of historical events that have reshaped human pericentromeric regions over recent evolutionary time.
Abstract: Despite considerable advances in sequencing of the human genome over the past few years, the organization and evolution of human pericentromeric regions have been difficult to resolve. This is due, in part, to the presence of large, complex blocks of duplicated genomic sequence at the boundary between centromeric satellite and unique euchromatic DNA. Here, we report the identification and characterization of an approximately 49-kb repeat sequence that exists in more than 40 copies within the human genome. This repeat is specific to highly duplicated pericentromeric regions with multiple copies distributed in an interspersed fashion among a subset of human chromosomes. Using this interspersed repeat (termed PIR4) as a marker of pericentromeric DNA, we recovered and sequence-tagged 3 Mb of pericentromeric DNA from a variety of human chromosomes as well as nonhuman primate genomes. A global evolutionary reconstruction of the dispersal of PIR4 sequence and analysis of flanking sequence supports a model in which pericentromeric duplications initiated before the separation of the great ape species (>12 MYA). Further, analyses of this duplication and associated flanking duplications narrow the major burst of pericentromeric duplication activity to a time just before the divergence of the African great ape and human species (5 to 7 MYA). These recent duplication exchange events substantially restructured the pericentromeric regions of hominoid chromosomes and created an architecture where large blocks of sequence are shared among nonhomologous chromosomes. This report provides the first global view of the series of historical events that have reshaped human pericentromeric regions over recent evolutionary time.

Journal ArticleDOI
TL;DR: A novel approach to constructing whole-genome microarrays based on PCR amplification of the 3' ends of each predicted gene from genomic DNA was developed, and an array representing more than 94% of the predicted genes and pseudogenes on chromosome 2 was constructed.
Abstract: The gene predictions and accompanying functional assignments resulting from the sequencing and annotation of a genome represent hypotheses that can be tested and used to develop a more complete understanding of the organism and its biology. In the model plant Arabidopsis thaliana, we developed a novel approach to constructing whole-genome microarrays based on PCR amplification of the 3' ends of each predicted gene from genomic DNA, and constructed an array representing more than 94% of the predicted genes and pseudogenes on chromosome 2. With this array, we examined various tissues and physiological conditions, providing expression-based validation for 84% of the gene predictions and providing clues as to the functions of many predicted genes. Further, by examining the distribution of expression along the physical chromosome, we were able to identify a region of repressed transcription that may represent a previously undescribed heterochromatic region.

Journal ArticleDOI
TL;DR: The utility of EST analysis is demonstrated by identifying new protease genes, which may be involved in hemoglobin degradation, in the Schistosoma mansoni transcriptome.
Abstract: Expressed sequence tag (EST) sequencing and analysis is a primary research tool to identify and characterize the Schistosoma mansoni transcriptome. As part of our gene discovery effort, a total of 5,793 ESTs have been generated from clones selected randomly from complementary DNA (cDNA) libraries constructed from male and female adult worms. Assembly analysis of all the 16,813 public S. mansoni ESTs has identified 1,920 distinct tentative consensus sequences (TCs) and 5,571 nonoverlapping ESTs (singletons). Of these, 376 TCs (20%) and 1,449 singletons (26%) are unique to the SUNY/TIGR sequencing effort. Tentative consensus sequences and singletons were distributed into various categories of biological roles associated with cell structure, metabolism, protein fate, signal transduction, transcription, protein synthesis, transporters, and cell growth. The TCs and singletons represent transcripts that can be used as a resource for functional annotation of genomic sequence data, comparative sequence analysis, and cDNA clone selection for microarray projects. The utility of EST analysis is demonstrated by identifying new protease genes, which may be involved in hemoglobin degradation.

Journal ArticleDOI
TL;DR: A method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.
Abstract: Motivation: The recent explosion of interest in mining the biomedical literature for associations between defined entities such as genes, diseases and drugs has made apparent the need for robust methods of identifying occurrences of these entities in biomedical text. Such concept-based indexing is strongly dependent on the availability of a comprehensive ontology or lexicon of biomedical terms. However, such ontologies are very difficult and expensive to construct, and often require extensive manual curation to render them suitable for use by automatic indexing programs. Furthermore, the use of statistically salient noun phrases as surrogates for curated terminology is not without difficulties, due to the lack of high-quality part-of-speech taggers specific to medical nomenclature. Results: We describe a method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger. This enhancement, when combined with appropriate training data, can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.