scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2003"


Journal ArticleDOI
TL;DR: Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.
Abstract: Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework. Although applicable to any system of molecular components and interactions, Cytoscape is most powerful when used in conjunction with large databases of protein-protein, protein-DNA, and genetic interactions that are increasingly available for humans and model organisms. Cytoscape's software Core provides basic functionality to layout and query the network; to visually integrate the network with expression profiles, phenotypes, and other molecular states; and to link the network to databases of functional annotations. The Core is extensible through a straightforward plug-in architecture, allowing rapid development of additional computational analyses and features. Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.

32,980 citations


Journal ArticleDOI
TL;DR: OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs.
Abstract: The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of "recent" paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome.

5,321 citations


Journal ArticleDOI
TL;DR: The PANTHER/X ontology is used to give a high-level representation of gene function across the human and mouse genomes, and the family HMMs are used to rank missense single nucleotide polymorphisms (SNPs) according to their likelihood of affecting protein function.
Abstract: In the genomic era, one of the fundamental goals is to characterize the function of proteins on a large scale. We describe a method, PANTHER, for relating protein sequence relationships to function relationships in a robust and accurate way. PANTHER is composed of two main components: the PANTHER library (PANTHER/LIB) and the PANTHER index (PANTHER/X). PANTHER/LIB is a collection of "books," each representing a protein family as a multiple sequence alignment, a Hidden Markov Model (HMM), and a family tree. Functional divergence within the family is represented by dividing the tree into subtrees based on shared function, and by subtree HMMs. PANTHER/X is an abbreviated ontology for summarizing and navigating molecular functions and biological processes associated with the families and subfamilies. We apply PANTHER to three areas of active research. First, we report the size and sequence diversity of the families and subfamilies, characterizing the relationship between sequence divergence and functional divergence across a wide range of protein families. Second, we use the PANTHER/X ontology to give a high-level representation of gene function across the human and mouse genomes. Third, we use the family HMMs to rank missense single nucleotide polymorphisms (SNPs), on a database-wide scale, according to their likelihood of affecting protein function.

2,857 citations


Journal ArticleDOI
TL;DR: This work describes BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences, and its modifications, the hardware environment on which it is run, and several empirical studies to validate its results.
Abstract: The Mouse Genome Analysis Consortium aligned the human and mouse genome sequences for a variety of purposes, using alignment programs that suited the various needs. For investigating issues regarding genome evolution, a particularly sensitive method was needed to permit alignment of a large proportion of the neutrally evolving regions. We selected a program called BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences. BLASTZ was subsequently modified, both to attain efficiency adequate for aligning entire mammalian genomes and to increase its sensitivity. This work describes BLASTZ, its modifications, the hardware environment on which we run it, and several empirical studies to validate its results.

1,281 citations


Journal ArticleDOI
TL;DR: The reconstructed metabolic network in the yeast Saccharomyces cerevisiae was reconstructed using currently available genomic, biochemical, and physiological information and may be used as the basis for in silico analysis of phenotypic functions.
Abstract: The metabolic network in the yeast Saccharomyces cerevisiae was reconstructed using currently available genomic, biochemical, and physiological information. The metabolic reactions were compartmentalized between the cytosol and the mitochondria, and transport steps between the compartments and the environment were included. A total of 708 structural open reading frames (ORFs) were accounted for in the reconstructed network, corresponding to 1035 metabolic reactions. Further, 140 reactions were included on the basis of biochemical evidence resulting in a genome-scale reconstructed metabolic network containing 1175 metabolic reactions and 584 metabolites. The number of gene functions included in the reconstructed network corresponds to approximately 16% of all characterized ORFs in S. cerevisiae. Using the reconstructed network, the metabolic capabilities of S. cerevisiae were calculated and compared with Escherichia coli. The reconstructed metabolic network is the first comprehensive network for a eukaryotic organism, and it may be used as the basis for in silico analysis of phenotypic functions.

1,127 citations


Journal ArticleDOI
TL;DR: Both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu.
Abstract: Comparing genomic sequences across related species is a fruitful source of biological insight, because functional elements such as exons tend to exhibit significant sequence similarity, whereas regions that are not functional tend to be less conserved. The first step in comparing genomic sequences is to align them—that is, to map the letters of one sequence to those of the others. There are several categories of alignments: local alignments that identify local similarities between regions of each sequence, and global alignments that find a monotonically increasing map between the letters of each sequence; pairwise alignments that compare two sequences, and multiple alignments that compare several sequences. Local pairwise alignment methods such as Smith-Waterman (1981), BLAST (Altschul et al. 1990, 1997), BLASTZ (Schwartz et al. 2000), SSAHA (Ning et al. 2001), and BLAT (Kent 2002) are able to pinpoint locations of rearrangements between two sequences, and are suitable for aligning draft sequences or individual reads. Global alignments are important because they reveal the shared order of biological features in the compared species, and produce a more accurate alignment at the base-pair level when the features are in the same order. The best-known global alignment algorithm is Needleman-Wunsch (1970), which requires time proportional to the product of the lengths of the aligned sequences. Unfortunately this algorithm is too inefficient for comparing long genomic sequences. Faster methods have been developed recently: DIALIGN (Morgenstern et al. 1998, Brudno and Morgenstern 2002), MUMmer (Delcher et al. 1999, 2002), GLASS (Batzoglou et al. 2000), WABA (Kent and Zahler 2000), and AVID (Bray et al. 2003). Most of these methods have proven effective in aligning genomic sequences from two closely related organisms, such as human and mouse or Caenorhabditis elegans and C. briggsae, but have not been tested in alignments between distant relatives such as human and fugu. Multiple alignments, a natural extension of two-sequence comparisons, are a powerful way to study biological sequences. Even weak similarity across several sequences usually reveals an important conserved biological feature (Dubchak et al. 2000; Gottgens et al. 2002). Moreover, multiple alignments enable the computation of local rates of evolution, giving a quantitative measure of the strength of evolutionary constraints and the functional importance of local regions (Simon et al. 2002). Multiple alignments are considerably more difficult to compute than are pairwise alignments: the running time scales as the product of the lengths of all the sequences. Formally, the problem is NP-complete (Wang and Jiang 1994; Bonizzoni and Vedova 2001). For this reason heuristic approaches are usually applied, of which the most widely used is progressive alignment, which constructs a multiple alignment by successive applications of a pairwise alignment algorithm. The best-known system based on progressive alignment is perhaps CLUSTALW (Thompson et al. 1994). Some other systems include MULTALIGN (Barton and Sternberg 1987), MULTAL (Taylor 1988), YAMA (Hardison et al. 1993, 1994), and PRRP (Gotoh 1996). DIALIGN (Morgenstern 1999) does not use progressive alignment; instead it uses another heuristic approach to chain local conserved blocks between several sequences into a multiple alignment. These systems can effectively align proteins and relatively short genomic regions, but are not efficient enough to align entire genomes. MGA (Hohl et al. 2002) is a rapid multiple aligner suitable for comparing very close homologs, such as different strains of a bacterium, but is not designed to align distant homologs. Here we describe novel systems for pairwise and multiple alignment of genomic sequences: LAGAN (Limited Area Global Alignment of Nucleotides), an efficient and reliable pairwise aligner that is suitable for genomic comparison of distantly related organisms, and MLAGAN (Multi-LAGAN), a multiple aligner based on progressive alignment with LAGAN. We tested our systems on sequence from 12 species generated for the genomic segment harboring the cystic fibrosis transmembrane conductance regulator (CFTR) gene (J.W. Thomas, J.W. Touchman, R.W. Blakesley, G.G. Bouffard, S.M. Beckstrom-Sternberg, E.H. Margulies, M. Blanchette, A.C. Siepel, P.J. Thomas, J.C. McDowell, B. Maskeri, N.F. Hansen, M.S. Schwartz, R.J. Weber, W.J. Kent, D. Karolchik, T.C. Bruen, R. Bevan, D.J. Cutler, S. Schwartz, L. Elnitski, J.R. Idol, A.B. Prasad, S.-Q. Lee-Lin, V.V.B. Maduro, M.E. Portnoy, N.L. Dietrich, N. Akhter, K. Ayele, B. Benjamin, K. Cariaga, C.P. Brinkley, S.Y. Brooks, S. Granite, X. Guan, J. Gupta, P. Haghighi, S-L. Ho, M.C. Huang, E. Karlins, P.L. Laric, R. Legaspi, M.J. Lim, Q.L. Maduro, C.A. Masiello, S.D. Mastrian, J.C. McCloskey, R. Pearson, S. Stantripop, E.E. Tiongson, J.T. Tran, C. Tsurgeon, J.L. Vogt, M.A. Walker, K.D. Wetherby, L.S. Wiggins, A.C. Young, L-H. Zhang, K. Osoegawa, B. Zhu, B. Zhao, C.L. Shu, P.J. De Jong, C.E. Lawrence, A.F. Smit, A. Chakravarti, D. Haussler, P. Green, W. Miller, and E.D. Green, in prep.). Based on comparisons with other available alignment programs and benchmarking on standard desktop computer systems, we conclude that LAGAN and MLAGAN are practical and reliable methods for large-scale pairwise and multiple genomic alignment that should prove useful for obtaining alignments of the entire human, mouse, fugu, rat, and other genomes in the context of a whole-genome alignment pipeline.

1,106 citations


Journal ArticleDOI
TL;DR: The Human Protein Reference Database (HPRD) as mentioned in this paper is an object database that integrates a wealth of information relevant to the function of human proteins in health and disease, including protein-protein interactions, posttranslational modifications, enzyme/substrate relationships, disease associations, tissue expression, and subcellular localization.
Abstract: Human Protein Reference Database (HPRD) is an object database that integrates a wealth of information relevant to the function of human proteins in health and disease. Data pertaining to thousands of protein-protein interactions, posttranslational modifications, enzyme/substrate relationships, disease associations, tissue expression, and subcellular localization were extracted from the literature for a nonredundant set of 2750 human proteins. Almost all the information was obtained manually by biologists who read and interpreted >300,000 published articles during the annotation process. This database, which has an intuitive query interface allowing easy access to all the features of proteins, was built by using open source technologies and will be freely available at http://www.hprd.org to the academic community. This unified bioinformatics platform will be useful in cataloging and mining the large number of proteomic interactions and alterations that will be discovered in the postgenomic era.

1,088 citations


Journal ArticleDOI
TL;DR: This method is fast, efficient, and reliable and makes it possible to generate cko-targeting vectors in less than 2 wk and should also facilitate the generation of knock-in mutations and transgene constructs, as well as expedite the analysis of regulatory elements and functional domains in or near genes.
Abstract: Phage-based Escherichia coli homologous recombination systems have recently been developed that now make it possible to subclone or modify DNA cloned into plasmids, BACs, or PACs without the need for restriction enzymes or DNA ligases. This new form of chromosome engineering, termed recombineering, has many different uses for functional genomic studies. Here we describe a new recombineering-based method for generating conditional mouse knockout (cko) mutations. This method uses homologous recombination mediated by the lambda phage Red proteins, to subclone DNA from BACs into high-copy plasmids by gap repair, and together with Cre or Flpe recombinases, to introduce loxP or FRT sites into the subcloned DNA. Unlike other methods that use short 45-55-bp regions of homology for recombineering, our method uses much longer regions of homology. We also make use of several new E. coli strains, in which the proteins required for recombination are expressed from a defective temperature-sensitive lambda prophage, and the Cre or Flpe recombinases from an arabinose-inducible promoter. We also describe two new Neo selection cassettes that work well in both E. coli and mouse ES cells. Our method is fast, efficient, and reliable and makes it possible to generate cko-targeting vectors in less than 2 wk. This method should also facilitate the generation of knock-in mutations and transgene constructs, as well as expedite the analysis of regulatory elements and functional domains in or near genes.

1,084 citations


Journal ArticleDOI
TL;DR: Although these stochastic methods cannot guarantee global optimality with certainty, their robustness, plus the fact that in inverse problems they have a known lower bound for the cost function, make them the best available candidates.
Abstract: Here we address the problem of parameter estimation (inverse problem) of nonlinear dynamic biochemical pathways. This problem is stated as a nonlinear programming (NLP) problem subject to nonlinear differential-algebraic constraints. These problems are known to be frequently ill-conditioned and multimodal. Thus, traditional (gradient-based) local optimization methods fail to arrive at satisfactory solutions. To surmount this limitation, the use of several state-of-the-art deterministic and stochastic global optimization methods is explored. A case study considering the estimation of 36 parameters of a nonlinear biochemical dynamic model is taken as a benchmark. Only a certain type of stochastic algorithm, evolution strategies (ES), is able to solve this problem successfully. Although these stochastic methods cannot guarantee global optimality with certainty, their robustness, plus the fact that in inverse problems they have a known lower bound for the cost function, make them the best available candidates.

908 citations


Journal ArticleDOI
TL;DR: The phylogeny and synteny data suggest that the common ancestor of zebrafish and pufferfish, a fish that gave rise to approximately 22000 species, experienced a large-scale gene or complete genome duplication event and that the puffer fish has lost many duplicates that the zebra fish has retained.
Abstract: Through phylogeny reconstruction we identified 49 genes with a single copy in man, mouse, and chicken, one or two copies in the tetraploid frog Xenopus laevis, and two copies in zebrafish (Danio rerio). For 22 of these genes, both zebrafish duplicates had orthologs in the pufferfish (Takifugu rubripes). For another 20 of these genes, we found only one pufferfish ortholog but in each case it was more closely related to one of the zebrafish duplicates than to the other. Forty-three pairs of duplicated genes map to 24 of the 25 zebrafish linkage groups but they are not randomly distributed; we identified 10 duplicated regions of the zebrafish genome that each contain between two and five sets of paralogous genes. These phylogeny and synteny data suggest that the common ancestor of zebrafish and pufferfish, a fish that gave rise to approximately 22000 species, experienced a large-scale gene or complete genome duplication event and that the pufferfish has lost many duplicates that the zebrafish has retained.

859 citations


Journal ArticleDOI
TL;DR: This work develops a method that simultaneously clusters genes and conditions, finding distinctive "checkerboard" patterns in matrices of gene expression data, if they exist, and applies it to a selection of publicly available cancer expression data sets.
Abstract: Global analyses of RNA expression levels are useful for classifying genes and overall phenotypes. Often these classification problems are linked, and one wants to find “marker genes” that are differentially expressed in particular sets of “conditions.” We have developed a method that simultaneously clusters genes and conditions, finding distinctive “checkerboard” patterns in matrices of gene expression data, if they exist. In a cancer context, these checkerboards correspond to genes that are markedly up- or downregulated in patients with particular types of tumors. Our method, spectral biclustering, is based on the observation that checkerboard structures in matrices of expression data can be found in eigenvectors corresponding to characteristic expression patterns across genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches, in particular the singular value decomposition (SVD), coupled with closely integrated normalization steps. We present a number of variants of the approach, depending on whether the normalization over genes and conditions is done independently or in a coupled fashion. We then apply spectral biclustering to a selection of publicly available cancer expression data sets, and examine the degree to which the approach is able to identify checkerboard structures. Furthermore, we compare the performance of our biclustering methods against a number of reasonable benchmarks (e.g., direct application of SVD or normalized cuts to raw data).

Journal ArticleDOI
TL;DR: It is concluded that the Arabidopsis lineage underwent at least two distinct episodes of duplication, one of which was a polyploidy that occurred much more recently than estimated previously and probably during the early emergence of the crucifer family.
Abstract: The Arabidopsis genome contains numerous large duplicated chromosomal segments, but the different approaches used in previous analyses led to different interpretations regarding the number and timing of ancestral large-scale duplication events. Here, using more appropriate methodology and a more recent version of the genome sequence annotation, we investigate the scale and timing of segmental duplications in Arabidopsis. We used protein sequence similarity searches to detect duplicated blocks in the genome, used the level of synonymous substitution between duplicated genes to estimate the relative ages of the blocks containing them, and analyzed the degree of overlap between adjacent duplicated blocks. We conclude that the Arabidopsis lineage underwent at least two distinct episodes of duplication. One was a polyploidy that occurred much more recently than estimated previously, before the Arabidopsis/Brassica rapa split and probably during the early emergence of the crucifer family (24-40 Mya). An older set of duplicated blocks was formed after the monocot/dicot divergence, and the relatively low level of overlap among these blocks indicates that at least some of them are remnants of a larger duplication such as a polyploidy or aneuploidy.

Journal ArticleDOI
TL;DR: The goal is to rapidly deliver allelic series of ethylmethanesulfonate-induced mutations in target 1-kb loci requested by the international research community.
Abstract: TILLING (Targeting Induced Local Lesions in Genomes) is a general reverse-genetic strategy that provides an allelic series of induced point mutations in genes of interest High-throughput TILLING allows the rapid and low-cost discovery of induced point mutations in populations of chemically mutagenized individuals As chemical mutagenesis is widely applicable and mutation detection for TILLING is dependent only on sufficient yield of PCR products, TILLING can be applied to most organisms We have developed TILLING as a service to the Arabidopsis community known as the Arabidopsis TILLING Project (ATP) Our goal is to rapidly deliver allelic series of ethylmethanesulfonate-induced mutations in target 1-kb loci requested by the international research community In the first year of public operation, ATP has discovered, sequenced, and delivered >1000 mutations in >100 genes ordered by Arabidopsis researchers The tools and methodologies described here can be adapted to create similar facilities for other organisms

Journal ArticleDOI
TL;DR: This work identified a class of divergently transcribed gene pairs, representing more than 10% of the genes in the genome, whose transcription start sites are separated by less than 1000 base pairs, and demonstrated that a bidirectional arrangement provides a unique mechanism of regulation for a significant number of mammalian genes.
Abstract: The alignment of full-length human cDNA sequences to the finished sequence of the human genome provides a unique opportunity to study the distribution of genes throughout the genome. By analyzing the distances between 23,752 genes, we identified a class of divergently transcribed gene pairs, representing more than 10% of the genes in the genome, whose transcription start sites are separated by less than 1000 base pairs. Although this bidirectional arrangement has been previously described in humans and other species, the prevalence of bidirectional gene pairs in the human genome is striking, and the mechanisms of regulation of all but a few bidirectional genes are unknown. Our work shows that the transcripts of many bidirectional pairs are coexpressed, but some are antiregulated. Further, we show that many of the promoter segments between two bidirectional genes initiate transcription in both directions and contain shared elements that regulate both genes. We also show that the bidirectional arrangement is often conserved among mouse orthologs. These findings demonstrate that a bidirectional arrangement provides a unique mechanism of regulation for a significant number of mammalian genes.

Journal ArticleDOI
TL;DR: This work measures mRNA decay rates in two human cell lines with high-density oligonucleotide arrays and investigates the dependence of decay rates on sequence composition, that is, the presence or absence of short mRNA motifs in various regions of the mRNA transcript.
Abstract: Although mRNA decay rates are a key determinant of the steady-state concentration for any given mRNA species, relatively little is known, on a population level, about what factors influence turnover rates and how these rates are integrated into cellular decisions. We decided to measure mRNA decay rates in two human cell lines with high-density oligonucleotide arrays that enable the measurement of decay rates simultaneously for thousands of mRNA species. Using existing annotation and the Gene Ontology hierarchy of biological processes, we assign mRNAs to functional classes at various levels of resolution and compare the decay rate statistics between these classes. The results show statistically significant organizational principles in the variation of decay rates among functional classes. In particular, transcription factor mRNAs have increased average decay rates compared with other transcripts and are enriched in "fast-decaying" mRNAs with half-lives <2 h. In contrast, we find that mRNAs for biosynthetic proteins have decreased average decay rates and are deficient in fast-decaying mRNAs. Our analysis of data from a previously published study of Saccharomyces cerevisiae mRNA decay shows the same functional organization of decay rates, implying that it is a general organizational scheme for eukaryotes. Additionally, we investigated the dependence of decay rates on sequence composition, that is, the presence or absence of short mRNA motifs in various regions of the mRNA transcript. Our analysis recovers the positive correlation of mRNA decay with known AU-rich mRNA motifs, but we also uncover further short mRNA motifs that show statistically significant correlation with decay. However, we also note that none of these motifs are strong predictors of mRNA decay rate, indicating that the regulation of mRNA decay is more complex and may involve the cooperative binding of several RNA-binding proteins at different sites.

Journal ArticleDOI
TL;DR: A new global alignment method called AVID is described, designed to be fast, memory efficient, and practical for sequence alignments of large genomic regions up to megabases long, and a format is established for the representation of alignments and methods for their comparison.
Abstract: In this paper we describe a new global alignment method called AVID. The method is designed to be fast, memory efficient, and practical for sequence alignments of large genomic regions up to megabases long. We present numerous applications of the method, ranging from the comparison of assemblies to alignment of large syntenic genomic regions and whole genome human/mouse alignments. We have also performed a quantitative comparison of AVID with other popular alignment tools. To this end, we have established a format for the representation of alignments and methods for their comparison. These formats and methods should be useful for future studies. The tools we have developed for the alignment comparisons, as well as the AVID program, are publicly available. See Web Site References section for AVID Web address and Web addresses for other programs discussed in this paper.

Journal ArticleDOI
TL;DR: ROMA (representational oligonucleotide microarray analysis) will assist in the discovery of genes and markers important in cancer, and theiscovery of loci that may be important in inherited predisposition to disease.
Abstract: We have developed a methodology we call ROMA (representational oligonucleotide microarray analysis), for the detection of the genomic aberrations in cancer and normal humans. By arraying oligonucleotide probes designed from the human genome sequence, and hybridizing with “representations” from cancer and normal cells, we detect regions of the genome with altered “copy number.” We achieve an average resolution of 30 kb throughout the genome, and resolutions as high as a probe every 15 kb are practical. We illustrate the characteristics of probes on the array and accuracy of measurements obtained using ROMA. Using this methodology, we identify variation between cancer and normal genomes, as well as between normal human genomes. In cancer genomes, we readily detect amplifications and large and small homozygous and hemizygous deletions. Between normal human genomes, we frequently detect large (100 kb to 1 Mb) deletions or duplications. Many of these changes encompass known genes. ROMA will assist in the discovery of genes and markers important in cancer, and the discovery of loci that may be important in inherited predispositions to disease.

Journal ArticleDOI
TL;DR: It is demonstrated that variation of gene expression between alleles is common, and this variation may contribute to human variability, as shown by real-time quantitative PCR experiments.
Abstract: Variations in gene sequence and expression underlie much of human variability. Despite the known biological roles of differential allelic gene expression resulting from X-chromosome inactivation and genomic imprinting, a large-scale analysis of allelic gene expression in human is lacking. We examined allele-specific gene expression of 1063 transcribed single-nucleotide polymorphisms (SNPs) by using Affymetrix HuSNP oligo arrays. Among the 602 genes that were heterozygous and expressed in kidney or liver tissues from seven individuals, 326 (54%) showed preferential expression of one allele in at least one individual, and 170 of those showed greater than fourfold difference between the two alleles. The allelic variation has been confirmed by real-time quantitative PCR experiments. Some of these 170 genes are known to be imprinted, such as SNRPN, IPW, HTR2A, and PEG3. Most of the differentially expressed genes are not in known imprinting domains but instead are distributed throughout the genome. Our studies demonstrate that variation of gene expression between alleles is common, and this variation may contribute to human variability.

Journal ArticleDOI
TL;DR: Amplification of genomic DNA directly from cells is highly reproducible, eliminates the need for DNA template purification, and allows genetic testing from small clinical samples, compared with older, PCR-based methods.
Abstract: Preparation of genomic DNA from clinical samples is a bottleneck in genotyping and DNA sequencing analysis and is frequently limited by the amount of specimen available. We use Multiple Displacement Amplification (MDA) to amplify the whole genome 10,000-fold directly from small amounts of whole blood, dried blood, buccal cells, cultured cells, and buffy coats specimens, generating large amounts of DNA for genetic testing. Genomic DNA was evenly amplified with complete coverage and consistent representation of all genes. All 47 loci analyzed from 44 individuals were represented in the amplified DNA at between 0.5- and 3.0-fold of the copy number in the starting genomic DNA template. A high-fidelity DNA polymerase ensures accurate representation of the DNA sequence. The amplified DNA was indistinguishable from the original genomic DNA template in 5 SNP and 10 microsatellite DNA assays on three different clinical sample types for 20 individuals. Amplification of genomic DNA directly from cells is highly reproducible, eliminates the need for DNA template purification, and allows genetic testing from small clinical samples. The low amplification bias of MDA represents a dramatic technical improvement in the ability to amplify a whole genome compared with older, PCR-based methods.

Journal ArticleDOI
TL;DR: TILLING can be used to detect the full spectrum of ENU-induced mutations in a vertebrate genome with the presence of many naturally occurring polymorphisms and is shown to be a highly efficient and easy method to do target-selected mutagenesis in zebrafish.
Abstract: One of the most powerful methods available to assign function to a gene is to inactivate or knockout the gene. Recently,we described the first target-selected knockout in zebrafish. Here,we report on the further improvements of this procedure,resulting in a highly efficient and easy method to do target-selected mutagenesis in zebrafish. A library of 4608 ENU-mutagenized F1 animals was generated and kept as a living stock. The DNA of these animals was screened for mutations in 16 genes by use of CEL-I-mediated heteroduplex cleavage (TILLING) and subsequent resequencing. In total,255 mutations were identified,of which 14 resulted in a premature stop codon,7 in a splice donor/acceptor site mutation,and 119 in an amino acid change. By this method,we potentially knocked out 13 different genes in a few months time. Furthermore,we show that TILLING can be used to detect the full spectrum of ENU-induced mutations in a vertebrate genome with the presence of many naturally occurring polymorphisms.

Journal ArticleDOI
TL;DR: The method complements bifurcation studies of the system's parameter dependence by providing estimates of sizes, correlations, and time scales of stochastic fluctuations by suitable variable changes and elimination of fast variables.
Abstract: Biochemical networks in single cells can display large fluctuations in molecule numbers, making mesoscopic approaches necessary for correct system descriptions. We present a general method that allows rapid characterization of the stochastic properties of intracellular networks. The starting point is a macroscopic description that identifies the system's elementary reactions in terms of rate laws and stoichiometries. From this formulation follows directly the stationary solution of the linear noise approximation (LNA) of the Master equation for all the components in the network. The method complements bifurcation studies of the system's parameter dependence by providing estimates of sizes, correlations, and time scales of stochastic fluctuations. We describe how the LNA can give precise system descriptions also near macroscopic instabilities by suitable variable changes and elimination of fast variables.

Journal ArticleDOI
TL;DR: A rice genome view of homologous wheat genome locations based on comparative sequence analysis revealed numerous chromosomal rearrangements that will significantly complicate the use of rice as a model for cross-species transfer of information in nonconserved regions.
Abstract: The use of DNA sequence-based comparative genomics for evolutionary studies and for transferring information from model species to crop species has revolutionized molecular genetics and crop improvement strategies. This study compared 4485 expressed sequence tags (ESTs) that were physically mapped in wheat chromosome bins, to the public rice genome sequence data from 2251 ordered BAC/PAC clones using BLAST. A rice genome view of homologous wheat genome locations based on comparative sequence analysis revealed numerous chromosomal rearrangements that will significantly complicate the use of rice as a model for cross-species transfer of information in nonconserved regions.

Journal ArticleDOI
TL;DR: A novel multilocus measure of LD, the chromosome segment homozygosity (CSH), which is a valuable statistic for inferring population histories from haplotype data, and has implications for mapping of disease loci.
Abstract: Linkage disequilibrium (LD) between densely spaced, polymorphic genetic markers in humans and other species contains information about historical population size. Inferring past population size is of interest both from an evolutionary perspective (e.g., testing the "out of Africa" hypothesis of human evolution) and to improve models for mapping of disease and quantitative trait genes. We propose a novel multilocus measure of LD, the chromosome segment homozygosity (CSH). CSH is defined for a specific chromosome segment, up to the full length of the chromosome. In computer simulations CSH was generally less variable than the r(2) measure of LD, and variability of CSH decreased as the number of markers in the chromosome segment was increased. The essence and utility of our novel measure is that CSH over long distances reflects recent effective population size (N), whereas CSH over small distances reflects the effective size in the more distant past. We illustrate the utility of CSH by calculating CSH from human and dairy cattle SNP and microsatellite marker data, and predicting N at various times in the past for each species. Results indicated an exponentially increasing N in humans and a declining N in dairy cattle. CSH is a valuable statistic for inferring population histories from haplotype data, and has implications for mapping of disease loci.

Journal ArticleDOI
TL;DR: Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides, however, it does vary with GC-content: Processed pseudogene occur mostly in intermediate GC- content regions.
Abstract: Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Their chromosomal distribution appears random and dispersed, with the numbers on chromosomes proportional to length, suggesting sustained "bombardment" over evolution. However, it does vary with GC-content: Processed pseudogenes occur mostly in intermediate GC-content regions. This is similar to Alus but contrasts with functional genes and L1-repeats. Pseudogenes, moreover, have age profiles similar to Alus. The number of pseudogenes associated with a given gene follows a power-law relationship, with a few genes giving rise to many pseudogenes and most giving rise to few. The prevalence of processed pseudogenes agrees well with germ-line gene expression. Highly expressed ribosomal proteins account for approximately 20% of the total. Other notables include cyclophilin-A, keratin, GAPDH, and cytochrome c.

Journal ArticleDOI
TL;DR: In this article, the authors derived a parsimonious scenario of gene losses for eukaryotic orthologous groups (KOGs) from seven complete eukarial genomes and introduced a numerical measure, the propensity for gene loss (PGL).
Abstract: Lineage-specific gene loss, to a large extent, accounts for the differences in gene repertoires between genomes, particularly among eukaryotes. We derived a parsimonious scenario of gene losses for eukaryotic orthologous groups (KOGs) from seven complete eukaryotic genomes. The scenario involves substantial gene loss in fungi, nematodes, and insects. Based on this evolutionary scenario and estimates of the divergence times between major eukaryotic phyla, we introduce a numerical measure, the propensity for gene loss (PGL). We explore the connection among the propensity of a gene to be lost in evolution (PGL value), protein sequence divergence, the effect of gene knockout on fitness, the number of protein-protein interactions, and expression level for the genes in KOGs. Significant correlations between PGL and each of these variables were detected. Genes that have a lower propensity to be lost in eukaryotic evolution accumulate fewer substitutions in their protein sequences and tend to be essential for the organism viability, tend to be highly expressed, and have many interaction partners. The dependence between PGL and gene dispensability and interactivity is much stronger than that for sequence evolution rate. Thus, propensity of a gene to be lost during evolution seems to be a direct reflection of its biological importance.

Journal ArticleDOI
TL;DR: The SPDI collection should facilitate efforts to better understand intercellular communication, may lead to new understandings of human diseases, and provides potential opportunities for the development of therapeutics.
Abstract: A large-scale effort, termed the Secreted Protein Discovery Initiative (SPDI), was undertaken to identify novel secreted and transmembrane proteins. In the first of several approaches, a biological signal sequence trap in yeast cells was utilized to identify cDNA clones encoding putative secreted proteins. A second strategy utilized various algorithms that recognize features such as the hydrophobic properties of signal sequences to identify putative proteins encoded by expressed sequence tags (ESTs) from human cDNA libraries. A third approach surveyed ESTs for protein sequence similarity to a set of known receptors and their ligands with the BLAST algorithm. Finally, both signal-sequence prediction algorithms and BLAST were used to identify single exons of potential genes from within human genomic sequence. The isolation of full-length cDNA clones for each of these candidate genes resulted in the identification of >1000 novel proteins. A total of 256 of these cDNAs are still novel, including variants and novel genes, per the most recent GenBank release version. The success of this large-scale effort was assessed by a bioinformatics analysis of the proteins through predictions of protein domains, subcellular localizations, and possible functional roles. The SPDI collection should facilitate efforts to better understand intercellular communication, may lead to new understandings of human diseases, and provides potential opportunities for the development of therapeutics.

Journal ArticleDOI
TL;DR: A high-throughput genotyping platform is developed by hybridizing genomic DNA from Arabidopsis thaliana accessions to an RNA expression GeneChip (AtGenome1), and it is demonstrated that array hybridization can be combined with bulk segregant analysis to quickly map mutations.
Abstract: We have developed a high-throughput genotyping platform by hybridizing genomic DNA from Arabidopsis thaliana accessions to an RNA expression GeneChip (AtGenome1). Using newly developed analytical tools, a large number of single-feature polymorphisms (SFPs) were identified. A comparison of two accessions, the reference strain Columbia (Col) and the strain Landsberg erecta (Ler), identified nearly 4000 SFPs, which could be reliably scored at a 5% error rate. Ler sequence was used to confirm 117 of 121 SFPs and to determine the sensitivity of array hybridization. Features containing sequence repeats, as well as those from high copy genes, showed greater polymorphism rates. A linear clustering algorithm was developed to identify clusters of SFPs representing potential deletions in 111 genes at a 5% false discovery rate (FDR). Among the potential deletions were transposons, disease resistance genes, and genes involved in secondary metabolism. The applicability of this technique was demonstrated by genotyping a recombinant inbred line. Recombination break points could be clearly defined, and in one case delimited to an interval of 29 kb. We further demonstrate that array hybridization can be combined with bulk segregant analysis to quickly map mutations. The extension of these tools to organisms with complex genomes, such as Arabidopsis, will greatly increase our ability to map and clone quantitative trait loci (QTL).

Journal ArticleDOI
TL;DR: Subgenic-resolution oligonucleotide microarrays were used to study global RNA degradation in wild-type Escherichia coli MG1655 and found a weak but highly significant correlation between the degradation of adjacent operon regions, suggesting that stability is determined by a combination of local and operon-wide stability determinants.
Abstract: Subgenic-resolution oligonucleotide microarrays were used to study global RNA degradation in wild-type Escherichia coli MG1655. RNA chemical half-lives were measured for 1036 open reading frames (ORFs) and for 329 known and predicted operons. The half-life of total mRNA was 6.8 min under the conditions tested. We also observed significant relationships between gene functional assignments and transcript stability. Unexpectedly, transcription of a single operon (tdcABCDEFG) was relatively rifampicin-insensitive and showed significant increases 2.5 min after rifampicin addition. This supports a novel mechanism of transcription for the tdc operon, whose promoter lacks any recognizable binding sites. Probe by probe analysis of all known and predicted operons showed that the 5 ends of operons degrade, on average, more quickly than the rest of the transcript, with stability increasing in a 3 direction, supporting and further generalizing the current model of a net 5 to 3 directionality of degradation. Hierarchical clustering analysis of operon degradation patterns revealed that this pattern predominates but is not exclusive. We found a weak but highly significant correlation between the degradation of adjacent operon regions, suggesting that stability is determined by a combination of local and operon-wide stability determinants. The 16 ORF dcw gene cluster, which has a complex promoter structure and a partially characterized degradation pattern, was studied at high resolution, allowing a detailed and integrated description of its abundance and degradation. We discuss the application of subgenic resolution DNA microarray analysis to study global mechanisms of RNA transcription and processing. Gene regulation is a dynamic process which can be controlled by a number of mechanisms as genetic information flows from nucleic acids to proteins. The study of gene regulation in the steady state, while informative, overlooks the underlying dynamics of the processes. Steady-state transcript levels are a result of both RNA synthesis and degradation, and as such, measurements of degradation rates can be used to determine their rates of synthesis (if their steady-state levels are known) as well as reveal regulation which occurs via changes in RNA stability. For the genetic regulatory network of Escherichia coli to

Journal ArticleDOI
TL;DR: The results indicate that combining data mining using PexFinder with PVX-based functional assays can facilitate the discovery of novel pathogen effector proteins and can be applied to a variety of eukaryotic plant pathogens, including oomycetes, fungi, and nematodes.
Abstract: Interactions between plants and microbial pathogens involve complex signal exchanges at the plant surface and intercellular space interface (Baker et al. 1997; Parniske 2000; Hahn and Mendgen 2001). For example, plant pathogens have the remarkable ability to manipulate biochemical, physiological, and morphological processes in their host plants through a diverse array of extracellular effector molecules that can either promote infection or trigger defense responses (Knogge 1996; Lauge and De Wit 1998; Collmer et al. 2000; Kjemtrup et al. 2000; Staskawicz et al. 2001). Typically, such molecules are secreted into the intercellular interface between the pathogen and the plant or delivered inside the host cell to reach their cellular target. Thus, discovery programs that target genes encoding extracellular proteins can be expected to increase the probability of identifying genes involved in virulence. This approach has been taken successfully in the study of bacterial pathogens and symbionts. For example, an early study showed that Sinorhizobium meliloti mutants deficient in extracellular proteins were five times more likely to be affected in symbiosis than random mutants (Long et al. 1988). More recently, the characterization of effector proteins secreted through the type III secretion system of animal- and plant-associated bacteria has emerged as a key strategy for understanding mechanisms of virulence (Collmer et al. 2000; Kjemtrup et al. 2000; Staskawicz et al. 2001). In eukaryotic plant pathogens, genomic studies that focus systematically on extracellular proteins remain limited to nematodes, in which secretions from the esophageal gland cells are thought to play critical roles in infection (Wang et al. 2001). However, several classes of oomycete and fungal effector molecules, such as elicitor proteins that induce plant defense responses and a programmed cell death response termed the “hypersensitive response” (HR), are known to require secretion (Lauge and De Wit 1998; Jia et al. 2000). Therefore, secretion is an essential mechanism for delivery of virulence factors by eukaryotic plant pathogens to their appropriate site in infected plant tissue. In eukaryotic cells, most secreted and membrane proteins are exported through the general secretory pathway (also known as type II secretion system) via short, N-terminal amino-acid sequences known as signal peptides (von Heijne 1985; Rapoport 1992). Typically, signal peptides contain one or two charged amino acids followed by a hydrophobic core, and the signal peptidase cleavage site is defined by a pair of small uncharged amino acids (von Heijne 1985). Although most of these features can be identified in known extracellular proteins, the particular amino acid sequences are highly degenerate, and cannot be identified using DNA hybridization or PCR-based techniques (Klein et al. 1996). However, with the advent of genomics, large sets of sequence data became available, creating the opportunity to develop and test predictive software to identify extracellular proteins. For example, SignalPv 2.0, a program that was developed using machine learning methods, assigns signal peptide prediction scores and putative cleavage sites to unknown amino acid sequences with a high level of accuracy (Nielsen et al. 1997; Nielsen and Krogh 1998; Menne et al. 2000). The Irish famine pathogen, Phytophthora infestans, is a eukaryotic oomycete microorganism that causes late blight, a worldwide devastating disease of potato and tomato (Fry and Goodwin 1997a,b). Although it is a pathogen of great economic importance, little is known about the molecular mechanisms involved in the pathogenicity and host specificity of P. infestans, and only a handful of genes have been implicated in interaction with host plants (Kamoun 2000, 2001). Structural genomics of Phytophthora is underway. Pilot cDNA sequencing projects were performed for P. infestans and another species, Phytophthora sojae (Kamoun et al. 1999b; Qutob et al. 2000), resulting in a database of expressed sequence tags (ESTs; Waugh et al. 2000). With the accumulation of sequence data for Phytophthora, the challenge is shifting to data mining and functional analyses. One important goal is to be able to associate a biological function with sequences with no significant similarity to known genes. With this objective in mind, we set up to identify systematically P. infestans cDNAs encoding extracellular proteins from EST databases. Here, we describe PexFinder, an algorithm for the automated identification of putative extracellular proteins from ESTs. We applied PexFinder to a P. infestans EST data set and selected 63 candidate Pex (Phytophthora extracellular proteins) cDNAs for functional expression in plants using a viral vector. This functional genomics strategy resulted in the discovery of a novel family of necrosis-inducing genes that are predicted to encode extracellular proteins with no similarity to sequences in public databases.

Journal ArticleDOI
TL;DR: The genome of biotype 1 strain V. vulnificus YJ016, an etiologic agent of human mortality from seafood-borne infections, was sequenced and a super-integron (SI) was identified, and the SI region spans 139 kbp and contains 188 gene cassettes.
Abstract: The halophile Vibrio vulnificus is an etiologic agent of human mortality from seafood-borne infections. We applied whole-genome sequencing and comparative analysis to investigate the evolution of this pathogen. The genome of biotype 1 strain, V. vulnificus YJ016, was sequenced and includes two chromosomes of estimated 3377 kbp and 1857 kbp in size, and a plasmid of 48,508 bp. A super-integron (SI) was identified, and the SI region spans 139 kbp and contains 188 gene cassettes. In contrast to non-SI sequences, the captured gene cassettes are unique for any given Vibrio species and are highly variable among V. vulnificus strains. Multiple rearrangements were found when comparing the 5.3-Mbp V. vulnificus YJ016 genome and the 4.0-Mbp V. cholerae El Tor N16961 genome. The organization of gene clusters of capsular polysaccharide, iron metabolism, and RTX toxin showed distinct genetic features of V. vulnificus and V. cholerae. The content of the V. vulnificus genome contained gene duplications and evidence of horizontal transfer, allowing for genetic diversity and function in the marine environment. The genomic information obtained in this study can be applied to monitoring vibrio infections and identifying virulence genes in V. vulnificus.