scispace - formally typeset
Search or ask a question

Showing papers in "BMC Genomics in 2008"


Journal ArticleDOI
TL;DR: A fully automated service for annotating bacterial and archaeal genomes that identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user.
Abstract: The number of prokaryotic genome sequences becoming available is growing steadily and is growing faster than our ability to accurately annotate them. We describe a fully automated service for annotating bacterial and archaeal genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user. In addition, the annotated genome can be browsed in an environment that supports comparative analysis with the annotated genomes maintained in the SEED environment. The service normally makes the annotated genome available within 12–24 hours of submission, but ultimately the quality of such a service will be judged in terms of accuracy, consistency, and completeness of the produced annotations. We summarize our attempts to address these issues and discuss plans for incrementally enhancing the service. By providing accurate, rapid annotation freely to the community we have created an important community resource. The service has now been utilized by over 120 external users annotating over 350 distinct genomes.

9,397 citations


Journal ArticleDOI
TL;DR: BioVenn is an easy-to-use web application to generate area-proportional Venn diagrams from lists of biological identifiers, which supports a wide range of identifiers from the most used biological databases currently available.
Abstract: In many genomics projects, numerous lists containing biological identifiers are produced. Often it is useful to see the overlap between different lists, enabling researchers to quickly observe similarities and differences between the data sets they are analyzing. One of the most popular methods to visualize the overlap and differences between data sets is the Venn diagram: a diagram consisting of two or more circles in which each circle corresponds to a data set, and the overlap between the circles corresponds to the overlap between the data sets. Venn diagrams are especially useful when they are 'area-proportional' i.e. the sizes of the circles and the overlaps correspond to the sizes of the data sets. Currently there are no programs available that can create area-proportional Venn diagrams connected to a wide range of biological databases. We designed a web application named BioVenn to summarize the overlap between two or three lists of identifiers, using area-proportional Venn diagrams. The user only needs to input these lists of identifiers in the textboxes and push the submit button. Parameters like colors and text size can be adjusted easily through the web interface. The position of the text can be adjusted by 'drag-and-drop' principle. The output Venn diagram can be shown as an SVG or PNG image embedded in the web application, or as a standalone SVG or PNG image. The latter option is useful for batch queries. Besides the Venn diagram, BioVenn outputs lists of identifiers for each of the resulting subsets. If an identifier is recognized as belonging to one of the supported biological databases, the output is linked to that database. Finally, BioVenn can map Affymetrix and EntrezGene identifiers to Ensembl genes. BioVenn is an easy-to-use web application to generate area-proportional Venn diagrams from lists of biological identifiers. It supports a wide range of identifiers from the most used biological databases currently available. Its implementation on the World Wide Web makes it available for use on any computer with internet connection, independent of operating system and without the need to install programs locally. BioVenn is freely accessible at http://www.cmbi.ru.nl/cdd/biovenn/ .

1,323 citations


Journal ArticleDOI
TL;DR: A genome-wide analysis of LEA proteins and their encoding genes in Arabidopsis thaliana indicates a wide range of sequence diversity, intracellular localizations, and expression patterns and indicates that they confer an evolutionary advantage for an organism under varying stressful environmental conditions.
Abstract: LEA (late embryogenesis abundant) proteins have first been described about 25 years ago as accumulating late in plant seed development. They were later found in vegetative plant tissues following environmental stress and also in desiccation tolerant bacteria and invertebrates. Although they are widely assumed to play crucial roles in cellular dehydration tolerance, their physiological and biochemical functions are largely unknown. We present a genome-wide analysis of LEA proteins and their encoding genes in Arabidopsis thaliana. We identified 51 LEA protein encoding genes in the Arabidopsis genome that could be classified into nine distinct groups. Expression studies were performed on all genes at different developmental stages, in different plant organs and under different stress and hormone treatments using quantitative RT-PCR. We found evidence of expression for all 51 genes. There was only little overlap between genes expressed in vegetative tissues and in seeds and expression levels were generally higher in seeds. Most genes encoding LEA proteins had abscisic acid response (ABRE) and/or low temperature response (LTRE) elements in their promoters and many genes containing the respective promoter elements were induced by abscisic acid, cold or drought. We also found that 33% of all Arabidopsis LEA protein encoding genes are arranged in tandem repeats and that 43% are part of homeologous pairs. The majority of LEA proteins were predicted to be highly hydrophilic and natively unstructured, but some were predicted to be folded. The analyses indicate a wide range of sequence diversity, intracellular localizations, and expression patterns. The high fraction of retained duplicate genes and the inferred functional diversification indicate that they confer an evolutionary advantage for an organism under varying stressful environmental conditions. This comprehensive analysis will be an important starting point for future efforts to elucidate the functional role of these enigmatic proteins.

838 citations


Journal ArticleDOI
TL;DR: The goal is to review the key discoveries and to weave these discoveries together to support novel approaches for understanding sequence-function relationships.
Abstract: Our first predictor of protein disorder was published just over a decade ago in the Proceedings of the IEEE International Conference on Neural Networks (Romero P, Obradovic Z, Kissinger C, Villafranca JE, Dunker AK (1997) Identifying disordered regions in proteins from amino acid sequence. Proceedings of the IEEE International Conference on Neural Networks, 1: 90–95). By now more than twenty other laboratory groups have joined the efforts to improve the prediction of protein disorder. While the various prediction methodologies used for protein intrinsic disorder resemble those methodologies used for secondary structure prediction, the two types of structures are entirely different. For example, the two structural classes have very different dynamic properties, with the irregular secondary structure class being much less mobile than the disorder class. The prediction of secondary structure has been useful. On the other hand, the prediction of intrinsic disorder has been revolutionary, leading to major modifications of the more than 100 year-old views relating protein structure and function. Experimentalists have been providing evidence over many decades that some proteins lack fixed structure or are disordered (or unfolded) under physiological conditions. In addition, experimentalists are also showing that, for many proteins, their functions depend on the unstructured rather than structured state; such results are in marked contrast to the greater than hundred year old views such as the lock and key hypothesis. Despite extensive data on many important examples, including disease-associated proteins, the importance of disorder for protein function has been largely ignored. Indeed, to our knowledge, current biochemistry books don't present even one acknowledged example of a disorder-dependent function, even though some reports of disorder-dependent functions are more than 50 years old. The results from genome-wide predictions of intrinsic disorder and the results from other bioinformatics studies of intrinsic disorder are demanding attention for these proteins. Disorder prediction has been important for showing that the relatively few experimentally characterized examples are members of a very large collection of related disordered proteins that are wide-spread over all three domains of life. Many significant biological functions are now known to depend directly on, or are importantly associated with, the unfolded or partially folded state. Here our goal is to review the key discoveries and to weave these discoveries together to support novel approaches for understanding sequence-function relationships. Intrinsically disordered protein is common across the three domains of life, but especially common among the eukaryotic proteomes. Signaling sequences and sites of posttranslational modifications are frequently, or very likely most often, located within regions of intrinsic disorder. Disorder-to-order transitions are coupled with the adoption of different structures with different partners. Also, the flexibility of intrinsic disorder helps different disordered regions to bind to a common binding site on a common partner. Such capacity for binding diversity plays important roles in both protein-protein interaction networks and likely also in gene regulation networks. Such disorder-based signaling is further modulated in multicellular eukaryotes by alternative splicing, for which such splicing events map to regions of disorder much more often than to regions of structure. Associating alternative splicing with disorder rather than structure alleviates theoretical and experimentally observed problems associated with the folding of different length, isomeric amino acid sequences. The combination of disorder and alternative splicing is proposed to provide a mechanism for easily "trying out" different signaling pathways, thereby providing the mechanism for generating signaling diversity and enabling the evolution of cell differentiation and multicellularity. Finally, several recent small molecules of interest as potential drugs have been shown to act by blocking protein-protein interactions based on intrinsic disorder of one of the partners. Study of these examples has led to a new approach for drug discovery, and bioinformatics analysis of the human proteome suggests that various disease-associated proteins are very rich in such disorder-based drug discovery targets.

643 citations


Journal ArticleDOI
TL;DR: Results challenge the proposal that SREBF1 is central for milk fat synthesis regulation and highlight a pivotal role for a concerted action among PPARG, PPARGC1A, and INSIG1.
Abstract: The molecular events associated with regulation of milk fat synthesis in the bovine mammary gland remain largely unknown. Our objective was to study mammary tissue mRNA expression via quantitative PCR of 45 genes associated with lipid synthesis (triacylglycerol and phospholipids) and secretion from the late pre-partum/non-lactating period through the end of subsequent lactation. mRNA expression was coupled with milk fatty acid (FA) composition and calculated indexes of FA desaturation and de novo synthesis by the mammary gland. Marked up-regulation and/or % relative mRNA abundance during lactation were observed for genes associated with mammary FA uptake from blood (LPL, CD36), intracellular FA trafficking (FABP3), long-chain (ACSL1) and short-chain (ACSS2) intracellular FA activation, de novo FA synthesis (ACACA, FASN), desaturation (SCD, FADS1), triacylglycerol synthesis (AGPAT6, GPAM, LPIN1), lipid droplet formation (BTN1A1, XDH), ketone body utilization (BDH1), and transcription regulation (INSIG1, PPARG, PPARGC1A). Change in SREBF1 mRNA expression during lactation, thought to be central for milk fat synthesis regulation, was ≤2-fold in magnitude, while expression of INSIG1, which negatively regulates SREBP activation, was >12-fold and had a parallel pattern of expression to PPARGC1A. Genes involved in phospholipid synthesis had moderate up-regulation in expression and % relative mRNA abundance. The mRNA abundance and up-regulation in expression of ABCG2 during lactation was markedly high, suggesting a biological role of this gene in milk synthesis/secretion. Weak correlations were observed between both milk FA composition and desaturase indexes (i.e., apparent SCD activity) with mRNA expression pattern of genes measured. A network of genes participates in coordinating milk fat synthesis and secretion. Results challenge the proposal that SREBF1 is central for milk fat synthesis regulation and highlight a pivotal role for a concerted action among PPARG, PPARGC1A, and INSIG1. Expression of SCD, the most abundant gene measured, appears to be key during milk fat synthesis. The lack of correlation between gene expression and calculated desaturase indexes does not support their use to infer mRNA expression or enzyme activity (e.g., SCD). Longitudinal mRNA expression allowed development of transcriptional regulation networks and an updated model of milk fat synthesis regulation.

634 citations


Journal ArticleDOI
TL;DR: A novel approach based on a much shorter barcode sequence is established and demonstrated its effectiveness in archival specimens, which will significantly broaden the application of DNA barcoding in biodiversity studies.
Abstract: The goal of DNA barcoding is to develop a species-specific sequence library for all eukaryotes. A 650 bp fragment of the cytochrome c oxidase 1 (CO1) gene has been used successfully for species-level identification in several animal groups. It may be difficult in practice, however, to retrieve a 650 bp fragment from archival specimens, (because of DNA degradation) or from environmental samples (where universal primers are needed). We used a bioinformatics analysis using all CO1 barcode sequences from GenBank and calculated the probability of having species-specific barcodes for varied size fragments. This analysis established the potential of much smaller fragments, mini-barcodes, for identifying unknown specimens. We then developed a universal primer set for the amplification of mini-barcodes. We further successfully tested the utility of this primer set on a comprehensive set of taxa from all major eukaryotic groups as well as archival specimens. In this study we address the important issue of minimum amount of sequence information required for identifying species in DNA barcoding. We establish a novel approach based on a much shorter barcode sequence and demonstrate its effectiveness in archival specimens. This approach will significantly broaden the application of DNA barcoding in biodiversity studies.

586 citations


Journal ArticleDOI
TL;DR: Detailed examination of two divergent examples of hub proteins support the conjecture that hub proteins often utilize intrinsic disorder to bind to multiple partners and provide detailed information about induced fit in structured regions.
Abstract: Proteins are involved in many interactions with other proteins leading to networks that regulate and control a wide variety of physiological processes. Some of these proteins, called hub proteins or hubs, bind to many different protein partners. Protein intrinsic disorder, via diversity arising from structural plasticity or flexibility, provide a means for hubs to associate with many partners (Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN: Flexible Nets: The roles of intrinsic disorder in protein interaction networks. FEBS J 2005, 272:5129-5148). Here we present a detailed examination of two divergent examples: 1) p53, which uses different disordered regions to bind to different partners and which also has several individual disordered regions that each bind to multiple partners, and 2) 14-3-3, which is a structured protein that associates with many different intrinsically disordered partners. For both examples, three-dimensional structures of multiple complexes reveal that the flexibility and plasticity of intrinsically disordered protein regions as well as induced-fit changes in the structured regions are both important for binding diversity. These data support the conjecture that hub proteins often utilize intrinsic disorder to bind to multiple partners and provide detailed information about induced fit in structured regions.

582 citations


Journal ArticleDOI
TL;DR: Bioinformatics analysis provides a valuable platform for gene discovery and functional prediction that helps explain the activity of A. ferrooxidans in industrial bioleaching and its role as a primary producer in acidic environments.
Abstract: Acidithiobacillus ferrooxidans is a major participant in consortia of microorganisms used for the industrial recovery of copper (bioleaching or biomining). It is a chemolithoautrophic, γ-proteobacterium using energy from the oxidation of iron- and sulfur-containing minerals for growth. It thrives at extremely low pH (pH 1–2) and fixes both carbon and nitrogen from the atmosphere. It solubilizes copper and other metals from rocks and plays an important role in nutrient and metal biogeochemical cycling in acid environments. The lack of a well-developed system for genetic manipulation has prevented thorough exploration of its physiology. Also, confusion has been caused by prior metabolic models constructed based upon the examination of multiple, and sometimes distantly related, strains of the microorganism. The genome of the type strain A. ferrooxidans ATCC 23270 was sequenced and annotated to identify general features and provide a framework for in silico metabolic reconstruction. Earlier models of iron and sulfur oxidation, biofilm formation, quorum sensing, inorganic ion uptake, and amino acid metabolism are confirmed and extended. Initial models are presented for central carbon metabolism, anaerobic metabolism (including sulfur reduction, hydrogen metabolism and nitrogen fixation), stress responses, DNA repair, and metal and toxic compound fluxes. Bioinformatics analysis provides a valuable platform for gene discovery and functional prediction that helps explain the activity of A. ferrooxidans in industrial bioleaching and its role as a primary producer in acidic environments. An analysis of the genome of the type strain provides a coherent view of its gene content and metabolic potential.

489 citations


Journal ArticleDOI
TL;DR: Abundance measurements for more than 1000 E. coli proteins presented in this work represent the most complete study of protein abundance in a bacterial cell so far and show significant associations between the abundance of a protein and its properties and functions in the cell.
Abstract: Knowledge about the abundance of molecular components is an important prerequisite for building quantitative predictive models of cellular behavior. Proteins are central components of these models, since they carry out most of the fundamental processes in the cell. Thus far, protein concentrations have been difficult to measure on a large scale, but proteomic technologies have now advanced to a stage where this information becomes readily accessible. Here, we describe an experimental scheme to maximize the coverage of proteins identified by mass spectrometry of a complex biological sample. Using a combination of LC-MS/MS approaches with protein and peptide fractionation steps we identified 1103 proteins from the cytosolic fraction of the Escherichia coli strain MC4100. A measure of abundance is presented for each of the identified proteins, based on the recently developed emPAI approach which takes into account the number of sequenced peptides per protein. The values of abundance are within a broad range and accurately reflect independently measured copy numbers per cell. As expected, the most abundant proteins were those involved in protein synthesis, most notably ribosomal proteins. Proteins involved in energy metabolism as well as those with binding function were also found in high copy number while proteins annotated with the terms metabolism, transcription, transport, and cellular organization were rare. The barrel-sandwich fold was found to be the structural fold with the highest abundance. Highly abundant proteins are predicted to be less prone to aggregation based on their length, pI values, and occurrence patterns of hydrophobic stretches. We also find that abundant proteins tend to be predominantly essential. Additionally we observe a significant correlation between protein and mRNA abundance in E. coli cells. Abundance measurements for more than 1000 E. coli proteins presented in this work represent the most complete study of protein abundance in a bacterial cell so far. We show significant associations between the abundance of a protein and its properties and functions in the cell. In this way, we provide both data and novel insights into the role of protein concentration in this model organism.

484 citations


Journal ArticleDOI
TL;DR: It is demonstrated that SNPs sampled in large-scale with 454 pyrosequencing can be used to detect evolutionary signatures among genes, providing one of the first genome-wide assessments of nucleotide diversity and Ka/Ks for a non-model plant species.
Abstract: Benefits from high-throughput sequencing using 454 pyrosequencing technology may be most apparent for species with high societal or economic value but few genomic resources. Rapid means of gene sequence and SNP discovery using this novel sequencing technology provide a set of baseline tools for genome-level research. However, it is questionable how effective the sequencing of large numbers of short reads for species with essentially no prior gene sequence information will support contig assemblies and sequence annotation. With the purpose of generating the first broad survey of gene sequences in Eucalyptus grandis, the most widely planted hardwood tree species, we used 454 technology to sequence and assemble 148 Mbp of expressed sequences (EST). EST sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes, promoting discovery of homologues to almost half of Arabidopsis genes, and a comprehensive survey of allelic variation in the transcriptome. By aligning the sequencing reads from multiple genotypes we detected 23,742 SNPs, 83% of which were validated in a sample. Genome-wide nucleotide diversity was estimated for 2,392 contigs using a modified theta (θ) parameter, adapted for measuring genetic diversity from polymorphisms detected by randomly sequencing a multi-genotype cDNA pool. Diversity estimates in non-synonymous nucleotides were on average 4x smaller than in synonymous, suggesting purifying selection. Non-synonymous to synonymous substitutions (Ka/Ks) among 2,001 contigs averaged 0.30 and was skewed to the right, further supporting that most genes are under purifying selection. Comparison of these estimates among contigs identified major functional classes of genes under purifying and diversifying selection in agreement with previous researches. In providing an abundance of foundational transcript sequences where limited prior genomic information existed, this work created part of the foundation for the annotation of the E. grandis genome that is being sequenced by the US Department of Energy. In addition we demonstrated that SNPs sampled in large-scale with 454 pyrosequencing can be used to detect evolutionary signatures among genes, providing one of the first genome-wide assessments of nucleotide diversity and Ka/Ks for a non-model plant species.

476 citations


Journal ArticleDOI
TL;DR: In vivo characterization of differentially-expressed products in gonads demonstrates that Angiotensin Converting Enzyme varies between Wolbachia infected and uninfected flies and that the variation occurs in a sex-specific manner, which supports the use of Wolbachian infected cell cultures as an appropriate model for predicting in vivo host/Wolbachia interactions.
Abstract: Intracellular Wolbachia bacteria are obligate, maternally-inherited, endosymbionts found frequently in insects and other invertebrates. The success of Wolbachia can be attributed in part to an ability to alter host reproduction via mechanisms including cytoplasmic incompatibility (CI), parthenogenesis, feminization and male killing. Despite substantial scientific effort, the molecular mechanisms underlying the Wolbachia/host interaction are unknown. Here, an in vitro Wolbachia infection was generated in the Drosophila S2 cell line, and transcription profiles of infected and uninfected cells were compared by microarray. Differentially-expressed patterns related to reproduction, immune response and heat stress response are observed, including multiple genes that have been previously reported to be involved in the Wolbachia/host interaction. Subsequent in vivo characterization of differentially-expressed products in gonads demonstrates that Angiotensin Converting Enzyme (Ance) varies between Wolbachia infected and uninfected flies and that the variation occurs in a sex-specific manner. Consistent with expectations for the conserved CI mechanism, the observed Ance expression pattern is repeatable in different Drosophila species and with different Wolbachia types. To examine Ance involvement in the CI phenotype, compatible and incompatible crosses of Ance mutant flies were conducted. Significant differences are observed in the egg hatch rate resulting from incompatible crosses, providing support for additional experiments examining for an interaction of Ance with the CI mechanism. Wolbachia infection is shown to affect the expression of multiple host genes, including Ance. Evidence for potential Ance involvement in the CI mechanism is described, including the prior report of Ance in spermatid differentiation, Wolbachia-induced sex-specific effects on Ance expression and an Ance mutation effect on CI levels. The results support the use of Wolbachia infected cell cultures as an appropriate model for predicting in vivo host/Wolbachia interactions.

Journal ArticleDOI
TL;DR: A gene classifier that can predict clinical outcome in tamoxifen-treated ER+ BC patients is developed and other genes and pathways that may elucidate further mechanisms that influence clinical outcome and prediction of response to tamoxIFen are proposed.
Abstract: Estrogen receptor positive (ER+) breast cancers (BC) are heterogeneous with regard to their clinical behavior and response to therapies. The ER is currently the best predictor of response to the anti-estrogen agent tamoxifen, yet up to 30–40% of ER+BC will relapse despite tamoxifen treatment. New prognostic biomarkers and further biological understanding of tamoxifen resistance are required. We used gene expression profiling to develop an outcome-based predictor using a training set of 255 ER+ BC samples from women treated with adjuvant tamoxifen monotherapy. We used clusters of highly correlated genes to develop our predictor to facilitate both signature stability and biological interpretation. Independent validation was performed using 362 tamoxifen-treated ER+ BC samples obtained from multiple institutions and treated with tamoxifen only in the adjuvant and metastatic settings. We developed a gene classifier consisting of 181 genes belonging to 13 biological clusters. In the independent set of adjuvantly-treated samples, it was able to define two distinct prognostic groups (HR 2.01 95%CI: 1.29–3.13; p = 0.002). Six of the 13 gene clusters represented pathways involved in cell cycle and proliferation. In 112 metastatic breast cancer patients treated with tamoxifen, one of the classifier components suggesting a cellular inflammatory mechanism was significantly predictive of response. We have developed a gene classifier that can predict clinical outcome in tamoxifen-treated ER+ BC patients. Whilst our study emphasizes the important role of proliferation genes in prognosis, our approach proposes other genes and pathways that may elucidate further mechanisms that influence clinical outcome and prediction of response to tamoxifen.

Journal ArticleDOI
TL;DR: The complete genome sequence of strain PXO99A is reported on and its comparison to two previously sequenced strains, KACC10331 and MAFF311018, which are highly similar to one another and point to sources of genomic variation and candidates for strain-specific adaptations of this pathogen.
Abstract: Xanthomonas oryzae pv. oryzae causes bacterial blight of rice (Oryza sativa L.), a major disease that constrains production of this staple crop in many parts of the world. We report here on the complete genome sequence of strain PXO99A and its comparison to two previously sequenced strains, KACC10331 and MAFF311018, which are highly similar to one another. The PXO99A genome is a single circular chromosome of 5,240,075 bp, considerably longer than the genomes of the other strains (4,941,439 bp and 4,940,217 bp, respectively), and it contains 5083 protein-coding genes, including 87 not found in KACC10331 or MAFF311018. PXO99A contains a greater number of virulence-associated transcription activator-like effector genes and has at least ten major chromosomal rearrangements relative to KACC10331 and MAFF311018. PXO99A contains numerous copies of diverse insertion sequence elements, members of which are associated with 7 out of 10 of the major rearrangements. A rapidly-evolving CRISPR (clustered regularly interspersed short palindromic repeats) region contains evidence of dozens of phage infections unique to the PXO99A lineage. PXO99A also contains a unique, near-perfect tandem repeat of 212 kilobases close to the replication terminus. Our results provide striking evidence of genome plasticity and rapid evolution within Xanthomonas oryzae pv. oryzae. The comparisons point to sources of genomic variation and candidates for strain-specific adaptations of this pathogen that help to explain the extraordinary diversity of Xanthomonas oryzae pv. oryzae genotypes and races that have been isolated from around the world.

Journal ArticleDOI
TL;DR: Construction and analysis of a small RNA library led to the identification of 20 conserved and 35 novel miRNA families in soybean and enable investigation of the role of miRNAs in rhizobial symbiosis.
Abstract: Small RNAs regulate a number of developmental processes in plants and animals. However, the role of small RNAs in legume-rhizobial symbiosis is largely unexplored. Symbiosis between legumes (e.g. soybean) and rhizobia bacteria (e.g. Bradyrhizobium japonicum) results in root nodules where the majority of biological nitrogen fixation occurs. We sought to identify microRNAs (miRNAs) regulated during soybean-B. japonicum symbiosis. We sequenced ~350000 small RNAs from soybean roots inoculated with B. japonicum and identified conserved miRNAs based on similarity to miRNAs known in other plant species and new miRNAs based on potential hairpin-forming precursors within soybean EST and shotgun genomic sequences. These bioinformatics analyses identified 55 families of miRNAs of which 35 were novel. A subset of these miRNAs were validated by Northern analysis and miRNAs differentially responding to B. japonicum inoculation were identified. We also identified putative target genes of the identified miRNAs and verified in vivo cleavage of a subset of these targets by 5'-RACE analysis. Using conserved miRNAs as internal control, we estimated that our analysis identified ~50% of miRNAs in soybean roots. Construction and analysis of a small RNA library led to the identification of 20 conserved and 35 novel miRNA families in soybean. The availability of complete and assembled genome sequence information will enable identification of many other miRNAs. The conserved miRNA loci and novel miRNAs identified in this study enable investigation of the role of miRNAs in rhizobial symbiosis.

Journal ArticleDOI
TL;DR: This study provides insight into the adaptive evolution of the mtDNA genome in mammals and its implications for the molecular mechanism of oxidative phosphorylation, and presents a framework for future experimental characterization of the impact of specific mutations in the function, physiology, and interactions of themtDNA encoded proteins involved in oxidativeosphorylation.
Abstract: The mitochondria produce up to 95% of a eukaryotic cell's energy through oxidative phosphorylation. The proteins involved in this vital process are under high functional constraints. However, metabolic requirements vary across species, potentially modifying selective pressures. We evaluate the adaptive evolution of 12 protein-coding mitochondrial genes in 41 placental mammalian species by assessing amino acid sequence variation and exploring the functional implications of observed variation in secondary and tertiary protein structures. Wide variation in the properties of amino acids were observed at functionally important regions of cytochrome b in species with more-specialized metabolic requirements (such as adaptation to low energy diet or large body size, such as in elephant, dugong, sloth, and pangolin, and adaptation to unusual oxygen requirements, for example diving in cetaceans, flying in bats, and living at high altitudes in alpacas). Signatures of adaptive variation in the NADH dehydrogenase complex were restricted to the loop regions of the transmembrane units which likely function as protons pumps. Evidence of adaptive variation in the cytochrome c oxidase complex was observed mostly at the interface between the mitochondrial and nuclear-encoded subunits, perhaps evidence of co-evolution. The ATP8 subunit, which has an important role in the assembly of F0, exhibited the highest signal of adaptive variation. ATP6, which has an essential role in rotor performance, showed a high adaptive variation in predicted loop areas. Our study provides insight into the adaptive evolution of the mtDNA genome in mammals and its implications for the molecular mechanism of oxidative phosphorylation. We present a framework for future experimental characterization of the impact of specific mutations in the function, physiology, and interactions of the mtDNA encoded proteins involved in oxidative phosphorylation.

Journal ArticleDOI
TL;DR: It is demonstrated that human bacteremia strains distribute over the entire span of E. coli phylogenetic diversity and that CCs represent important phylogenetic units for pathogenesis and comparative genomics.
Abstract: Extraintestinal pathogenic Escherichia coli (ExPEC) strains represent a huge public health burden. Knowledge of their clonal diversity and of the association of clones with genomic content and clinical features is a prerequisite to recognize strains with a high invasive potential. In order to provide an unbiased view of the diversity of E. coli strains responsible for bacteremia, we studied 161 consecutive isolates from patients with positive blood culture obtained during one year in two French university hospitals. We collected precise clinical information, multilocus sequence typing (MLST) data and virulence gene content for all isolates. A subset representative of the clonal diversity was subjected to comparative genomic hybridization (CGH) using 2,324 amplicons from the flexible gene pool of E. coli. Recombination-insensitive phylogenetic analysis of MLST data in combination with the ECOR collection revealed that bacteremic E. coli isolates were highly diverse and distributed into five major lineages, corresponding to the classical E. coli phylogroups (A+B1, B2, D and E) and group F, which comprises strains previously assigned to D. Compared to other strains of phylogenetic group B2, strains belonging to MLST-derived clonal complexes (CCs) CC1 and CC4 were associated (P < 0.05) with a urinary origin. In contrast, no CC appeared associated with severe sepsis or unfavorable outcome of the bacteremia. CGH analysis revealed genomic characteristics of the distinct CCs and identified genomic regions associated with CC1 and/or CC4. Our results demonstrate that human bacteremia strains distribute over the entire span of E. coli phylogenetic diversity and that CCs represent important phylogenetic units for pathogenesis and comparative genomics.

Journal ArticleDOI
TL;DR: This study provided a transcriptomic signature for OTSCC that may lead to a diagnosis or screen tool and provide the foundation for further functional validation of these specific candidate genes for O TSCC.
Abstract: The head and neck/oral squamous cell carcinoma (HNOSCC) is a diverse group of cancers, which develop from many different anatomic sites and are associated with different risk factors and genetic characteristics. The oral tongue squamous cell carcinoma (OTSCC) is one of the most common types of HNOSCC. It is significantly more aggressive than other forms of HNOSCC, in terms of local invasion and spread. In this study, we aim to identify specific transcriptomic signatures that associated with OTSCC. Genome-wide transcriptomic profiles were obtained for 53 primary OTSCCs and 22 matching normal tissues. Genes that exhibit statistically significant differences in expression between OTSCCs and normal were identified. These include up-regulated genes (MMP1, MMP10, MMP3, MMP12, PTHLH, INHBA, LAMC2, IL8, KRT17, COL1A2, IFI6, ISG15, PLAU, GREM1, MMP9, IFI44, CXCL1), and down-regulated genes (KRT4, MAL, CRNN, SCEL, CRISP3, SPINK5, CLCA4, ADH1B, P11, TGM3, RHCG, PPP1R3C, CEACAM7, HPGD, CFD, ABCA8, CLU, CYP3A5). The expressional difference of IL8 and MMP9 were further validated by real-time quantitative RT-PCR and immunohistochemistry. The Gene Ontology analysis suggested a number of altered biological processes in OTSCCs, including enhancements in phosphate transport, collagen catabolism, I-kappaB kinase/NF-kappaB signaling cascade, extracellular matrix organization and biogenesis, chemotaxis, as well as suppressions of superoxide release, hydrogen peroxide metabolism, cellular response to hydrogen peroxide, keratinization, and keratinocyte differentiation in OTSCCs. In summary, our study provided a transcriptomic signature for OTSCC that may lead to a diagnosis or screen tool and provide the foundation for further functional validation of these specific candidate genes for OTSCC.

Journal ArticleDOI
TL;DR: The discovery that mosquitoes collected from different types of breeding sites display differing profiles of metabolic genes at the adult stage may reflect the influence of a range of xenobiotics on selecting for resistance in mosquitoes.
Abstract: Insecticide resistance in Anopheles mosquitoes is threatening the success of malaria control programmes. This is particularly true in Benin where pyrethroid resistance has been linked to the failure of insecticide treated bed nets. The role of mutations in the insecticide target sites in conferring resistance has been clearly established. In this study, the contribution of other potential resistance mechanisms was investigated in Anopheles gambiae s.s. from a number of localities in Southern Benin and Nigeria. The mosquitoes were sampled from a variety of breeding sites in a preliminary attempt to investigate the role of contamination of mosquito breeding sites in selecting for resistance in adult mosquitoes. All mosquitoes sampled belonged to the M form of An. gambiae s.s. There were high levels of permethrin resistance in an agricultural area (Akron) and an urban area (Gbedjromede), low levels of resistance in mosquito samples from an oil contaminated site (Ojoo) and complete susceptibility in the rural Orogun location. The target site mutation kdrW was detected at high levels in two of the populations (Akron f = 0.86 and Gbedjromede f = 0.84) but was not detected in Ojoo or Orogun. Microarray analysis using the Anopheles gambiae detox chip identified two P450s, CYP6P3 and CYP6M2 up regulated in all three populations, the former was expressed at particularly high levels in the Akron (12.4-fold) and Ojoo (7.4-fold) populations compared to the susceptible population. Additional detoxification and redox genes were also over expressed in one or more populations including two cuticular pre-cursor genes which were elevated in two of the three resistant populations. Multiple resistance mechanisms incurred in the different breeding sites contribute to resistance to permethrin in Benin. The cytochrome P450 genes, CYP6P3 and CYP6M2 are upregulated in all three resistant populations analysed. Several additional potential resistance mechanisms were also identified that warrant further investigation. Metabolic genes were over expressed irrespective of the presence of kdr, the latter resistance mechanism being absent in one resistant population. The discovery that mosquitoes collected from different types of breeding sites display differing profiles of metabolic genes at the adult stage may reflect the influence of a range of xenobiotics on selecting for resistance in mosquitoes.

Journal ArticleDOI
TL;DR: Comparative phylogenetic analysis of vertebrate TLR genes provides insight into their patterns and processes of gene evolution, with examples of both gene gain and gene loss.
Abstract: Toll-like receptors (TLRs) perform a vital role in disease resistance through their recognition of pathogen associated molecular patterns (PAMPs). Recent advances in genomics allow comparison of TLR genes within and between many species. This study takes advantage of the recently sequenced chicken genome to determine the complete chicken TLR repertoire and place it in context of vertebrate genomic evolution. The chicken TLR repertoire consists of ten genes. Phylogenetic analyses show that six of these genes have orthologs in mammals and fish, while one is only shared by fish and three appear to be unique to birds. Furthermore the phylogeny shows that TLR1-like genes arose independently in fish, birds and mammals from an ancestral gene also shared by TLR6 and TLR10. All other TLRs were already present prior to the divergence of major vertebrate lineages 550 Mya (million years ago) and have since been lost in certain lineages. Phylogenetic analysis shows the absence of TLRs 8 and 9 in chicken to be the result of gene loss. The notable exception to the tendency of gene loss in TLR evolution is found in chicken TLRs 1 and 2, each of which underwent gene duplication about 147 and 65 Mya, respectively. Comparative phylogenetic analysis of vertebrate TLR genes provides insight into their patterns and processes of gene evolution, with examples of both gene gain and gene loss. In addition, these comparisons clarify the nomenclature of TLR genes in vertebrates.

Journal ArticleDOI
TL;DR: The data shows that well-characterized non-coding RNA, such as tRNA, snoRNA, and snRNA are cleaved at sites specific to the class of ncRNA, indicating that the small RNAs are a product of dsRNA formation and their subsequent cleavage.
Abstract: Small RNA attracts increasing interest based on the discovery of RNA silencing and the rapid progress of our understanding of these phenomena. Although recent studies suggest the possible existence of yet undiscovered types of small RNAs in higher organisms, many studies to profile small RNA have focused on miRNA and/or siRNA rather than on the exploration of additional classes of RNAs. Here, we explored human small RNAs by unbiased sequencing of RNAs with sizes of 19–40 nt. We provide substantial evidences for the existence of independent classes of small RNAs. Our data shows that well-characterized non-coding RNA, such as tRNA, snoRNA, and snRNA are cleaved at sites specific to the class of ncRNA. In particular, tRNA cleavage is regulated depending on tRNA type and tissue expression. We also found small RNAs mapped to genomic regions that are transcribed in both directions by bidirectional promoters, indicating that the small RNAs are a product of dsRNA formation and their subsequent cleavage. Their partial similarity with ribosomal RNAs (rRNAs) suggests unrevealed functions of ribosomal DNA or interstitial rRNA. Further examination revealed six novel miRNAs. Our results underscore the complexity of the small RNA world and the biogenesis of small RNAs.

Journal ArticleDOI
TL;DR: The SISPA methodology is adapted to genome sequencing of RNA and DNA viruses and of great utility in generating whole genome assemblies for viruses with little or no available sequence information, viruses from greatly divergent families, previously uncharacterized viruses, or to more fully describe mixed viral infections.
Abstract: Most emerging health threats are of zoonotic origin. For the overwhelming majority, their causative agents are RNA viruses which include but are not limited to HIV, Influenza, SARS, Ebola, Dengue, and Hantavirus. Of increasing importance therefore is a better understanding of global viral diversity to enable better surveillance and prediction of pandemic threats; this will require rapid and flexible methods for complete viral genome sequencing. We have adapted the SISPA methodology [1–3] to genome sequencing of RNA and DNA viruses. We have demonstrated the utility of the method on various types and sources of viruses, obtaining near complete genome sequence of viruses ranging in size from 3,000–15,000 kb with a median depth of coverage of 14.33. We used this technique to generate full viral genome sequence in the presence of host contaminants, using viral preparations from cell culture supernatant, allantoic fluid and fecal matter. The method described is of great utility in generating whole genome assemblies for viruses with little or no available sequence information, viruses from greatly divergent families, previously uncharacterized viruses, or to more fully describe mixed viral infections.

Journal ArticleDOI
TL;DR: This comparative genome-wide overview of the PP2C family in Arabidopsis and rice provides insights into the functions and regulatory mechanisms, as well as the evolution and divergence of thePP2C genes in dicots and monocots.
Abstract: The protein phosphatase 2Cs (PP2Cs) from various organisms have been implicated to act as negative modulators of protein kinase pathways involved in diverse environmental stress responses and developmental processes. A genome-wide overview of the PP2C gene family in plants is not yet available. A comprehensive computational analysis identified 80 and 78 PP2C genes in Arabidopsis thaliana (AtPP2Cs) and Oryza sativa (OsPP2Cs), respectively, which denotes the PP2C gene family as one of the largest families identified in plants. Phylogenic analysis divided PP2Cs in Arabidopsis and rice into 13 and 11 subfamilies, respectively, which are supported by the analyses of gene structures and protein motifs. Comparative analysis between the PP2C genes in Arabidopsis and rice identified common and lineage-specific subfamilies and potential 'gene birth-and-death' events. Gene duplication analysis reveals that whole genome and chromosomal segment duplications mainly contributed to the expansion of both OsPP2Cs and AtPP2Cs, but tandem or local duplication occurred less frequently in Arabidopsis than rice. Some protein motifs are widespread among the PP2C proteins, whereas some other motifs are specific to only one or two subfamilies. Expression pattern analysis suggests that 1) most PP2C genes play functional roles in multiple tissues in both species, 2) the induced expression of most genes in subfamily A by diverse stimuli indicates their primary role in stress tolerance, especially ABA response, and 3) the expression pattern of subfamily D members suggests that they may constitute positive regulators in ABA-mediated signaling pathways. The analyses of putative upstream regulatory elements by two approaches further support the functions of subfamily A in ABA signaling, and provide insights into the shared and different transcriptional regulation machineries in dicots and monocots. This comparative genome-wide overview of the PP2C family in Arabidopsis and rice provides insights into the functions and regulatory mechanisms, as well as the evolution and divergence of the PP2C genes in dicots and monocots. Bioinformatics analyses suggest that plant PP2C proteins from different subfamilies participate in distinct signaling pathways. Our results have established a solid foundation for future studies on the functional divergence in different PP2C subfamilies.

Journal ArticleDOI
TL;DR: The heat stress responsive genes identified in this study will facilitate the understanding of molecular basis for heatolerance in different wheat genotypes and future improvement of heat tolerance in wheat and other cereals.
Abstract: Wheat is a major crop in the world, and the high temperature stress can reduce the yield of wheat by as much as 15%. The molecular changes in response to heat stress are poorly understood. Using GeneChip® Wheat Genome Array, we analyzed genome-wide gene expression profiles in the leaves of two wheat genotypes, namely, heat susceptible 'Chinese Spring' (CS) and heat tolerant 'TAM107' (TAM). A total of 6560 (~10.7%) probe sets displayed 2-fold or more changes in expression in at least one heat treatment (f alse d iscovery r ate, FDR, α = 0.001). Except for heat shock protein (HSP) and heat shock factor (HSF) genes, these putative heat responsive genes encode transcription factors and proteins involved in phytohormone biosynthesis/signaling, calcium and sugar signal pathways, RNA metabolism, ribosomal proteins, primary and secondary metabolisms, as well as proteins related to other stresses. A total of 313 probe sets were differentially expressed between the two genotypes, which could be responsible for the difference in heat tolerance of the two genotypes. Moreover, 1314 were differentially expressed between the heat treatments with and without pre-acclimation, and 4533 were differentially expressed between short and prolonged heat treatments. The differences in heat tolerance in different wheat genotypes may be associated with multiple processes and mechanisms involving HSPs, transcription factors, and other stress related genes. Heat acclimation has little effects on gene expression under prolonged treatments but affects gene expression in wheat under short-term heat stress. The heat stress responsive genes identified in this study will facilitate our understanding of molecular basis for heat tolerance in different wheat genotypes and future improvement of heat tolerance in wheat and other cereals.

Journal ArticleDOI
TL;DR: A database-assisted system, PlantPAN (Plant Promoter Analysis Navigator), for recognizing combinatorial cis-regulatory elements with a distance constraint in sets of plant genes and enables other regulatory features in a plant promoter, such as CpG/CpNpG islands and tandem repeats, to be displayed.
Abstract: The elucidation of transcriptional regulation in plant genes is important area of research for plant scientists, following the mapping of various plant genomes, such as A. thaliana, O. sativa and Z. mays. A variety of bioinformatic servers or databases of plant promoters have been established, although most have been focused only on annotating transcription factor binding sites in a single gene and have neglected some important regulatory elements (tandem repeats and CpG/CpNpG islands) in promoter regions. Additionally, the combinatorial interaction of transcription factors (TFs) is important in regulating the gene group that is associated with the same expression pattern. Therefore, a tool for detecting the co-regulation of transcription factors in a group of gene promoters is required. This study develops a database-assisted system, PlantPAN (Plant Promoter Analysis Navigator), for recognizing combinatorial cis-regulatory elements with a distance constraint in sets of plant genes. The system collects the plant transcription factor binding profiles from PLACE, TRANSFAC (public release 7.0), AGRIS, and JASPER databases and allows users to input a group of gene IDs or promoter sequences, enabling the co-occurrence of combinatorial transcription factor binding sites (TFBSs) within a defined distance (20 bp to 200 bp) to be identified. Furthermore, the new resource enables other regulatory features in a plant promoter, such as CpG/CpNpG islands and tandem repeats, to be displayed. The regulatory elements in the conserved regions of the promoters across homologous genes are detected and presented. In addition to providing a user-friendly input/output interface, PlantPAN has numerous advantages in the analysis of a plant promoter. Several case studies have established the effectiveness of PlantPAN. This novel analytical resource is now freely available at http://PlantPAN.mbc.nctu.edu.tw .

Journal ArticleDOI
TL;DR: Deep sequencing of short RNAs from M. truncatula leaves identified eight new miRNAs indicating that specific miRN as well as 26 novel miRNA candidates that were potentially generated from 32 loci.
Abstract: High-throughput sequencing technology is capable to identify novel short RNAs in plant species. We used Solexa sequencing to find new microRNAs in one of the model legume species, barrel medic (Medicago truncatula). 3,948,871 reads were obtained from two separate short RNA libraries generated from total RNA extracted from M. truncatula leaves, representing 1,563,959 distinct sequences. 2,168,937 reads were mapped to the available M. truncatula genome corresponding to 619,175 distinct sequences. 174,504 reads representing 25 conserved miRNA families showed perfect matches to known miRNAs. We also identified 26 novel miRNA candidates that were potentially generated from 32 loci. Nine of these loci produced eight distinct sequences, for which the miRNA* sequences were also sequenced. These sequences were not described in other plant species and accumulation of these eight novel miRNAs was confirmed by Northern blot analysis. Potential target genes were predicted for most conserved and novel miRNAs. Deep sequencing of short RNAs from M. truncatula leaves identified eight new miRNAs indicating that specific miRNAs exist in legume species.

Journal ArticleDOI
TL;DR: This investigation has identified 23 rice genes belonging to DCL, Argonaute and RDR gene families that could potentially be involved in reproductive development-specific gene regulatory mechanisms and a basis for further, more detailed investigations aimed at understanding the contribution of individual components of RNA silencing machinery during reproductive phase of plant development.
Abstract: Important developmental processes in both plants and animals are partly regulated by genes whose expression is modulated at the post-transcriptional level by processes such as RNA interference (RNAi). Dicers, Argonautes and RNA-dependent RNA polymerases (RDR) form the core components that facilitate gene silencing and have been implicated in the initiation and maintenance of the trigger RNA molecules, central to process of RNAi. Investigations in eukaryotes have revealed that these proteins are encoded by variable number of genes with plants showing relatively higher number in each gene family. To date, no systematic expression profiling of these genes in any of the organisms has been reported. In this study, we provide a complete analysis of rice Dicer-like, Argonaute and RDR gene families including gene structure, genomic localization and phylogenetic relatedness among gene family members. We also present microarray-based expression profiling of these genes during 14 stages of reproductive and 5 stages of vegetative development and in response to cold, salt and dehydration stress. We have identified 8 Dicer-like (OsDCLs), 19 Argonaute (OsAGOs) and 5 RNA-dependent RNA polymerase (OsRDRs) genes in rice. Based on phylogeny, each of these genes families have been categorized into four subgroups. Although most of the genes express both in vegetative and reproductive organs, 2 OsDCLs, 14 OsAGOs and 3 OsRDRs were found to express specifically/preferentially during stages of reproductive development. Of these, 2 OsAGOs exhibited preferential up-regulation in seeds. One of the Argonautes (OsAGO2) also showed specific up-regulation in response to cold, salt and dehydration stress. This investigation has identified 23 rice genes belonging to DCL, Argonaute and RDR gene families that could potentially be involved in reproductive development-specific gene regulatory mechanisms. These data provide an insight into probable domains of activity of these genes and a basis for further, more detailed investigations aimed at understanding the contribution of individual components of RNA silencing machinery during reproductive phase of plant development.

Journal ArticleDOI
TL;DR: The genome sequence of A. salmonicida was determined to provide a better understanding of the virulence factors used by this pathogen to infect fish and provide insights into the mechanisms used by the bacterium for infection and avoidance of host defence systems.
Abstract: Aeromonas salmonicida subsp. salmonicida is a Gram-negative bacterium that is the causative agent of furunculosis, a bacterial septicaemia of salmonid fish. While other species of Aeromonas are opportunistic pathogens or are found in commensal or symbiotic relationships with animal hosts, A. salmonicida subsp. salmonicida causes disease in healthy fish. The genome sequence of A. salmonicida was determined to provide a better understanding of the virulence factors used by this pathogen to infect fish. The nucleotide sequences of the A. salmonicida subsp. salmonicida A449 chromosome and two large plasmids are characterized. The chromosome is 4,702,402 bp and encodes 4388 genes, while the two large plasmids are 166,749 and 155,098 bp with 178 and 164 genes, respectively. Notable features are a large inversion in the chromosome and, in one of the large plasmids, the presence of a Tn21 composite transposon containing mercury resistance genes and an In2 integron encoding genes for resistance to streptomycin/spectinomycin, quaternary ammonia compounds, sulphonamides and chloramphenicol. A large number of genes encoding potential virulence factors were identified; however, many appear to be pseudogenes since they contain insertion sequences, frameshifts or in-frame stop codons. A total of 170 pseudogenes and 88 insertion sequences (of ten different types) are found in the A. salmonicida genome. Comparison with the A. hydrophila ATCC 7966T genome reveals multiple large inversions in the chromosome as well as an approximately 9% difference in gene content indicating instances of single gene or operon loss or gain. A limited number of the pseudogenes found in A. salmonicida A449 were investigated in other Aeromonas strains and species. While nearly all the pseudogenes tested are present in A. salmonicida subsp. salmonicida strains, only about 25% were found in other A. salmonicida subspecies and none were detected in other Aeromonas species. Relative to the A. hydrophila ATCC 7966T genome, the A. salmonicida subsp. salmonicida genome has acquired multiple mobile genetic elements, undergone substantial rearrangement and developed a significant number of pseudogenes. These changes appear to be a consequence of adaptation to a specific host, salmonid fish, and provide insights into the mechanisms used by the bacterium for infection and avoidance of host defence systems.

Journal ArticleDOI
TL;DR: The Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets, is introduced, based on enhanced suffix arrays that gives a much larger flexibility concerning the choice of the k-mers size.
Abstract: The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks. Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 109 bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C0t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity. The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see http://www.zbh.uni-hamburg.de/Tallymer .

Journal ArticleDOI
TL;DR: An experimentally validated collection of murine primer pairs for PCR and QPCR which can be used under a common PCR thermal profile, allowing the evaluation of transcript abundance of a large number of genes in parallel.
Abstract: Quantitative polymerase chain reaction (QPCR) is a widely applied analytical method for the accurate determination of transcript abundance. Primers for QPCR have been designed on a genomic scale but non-specific amplification of non-target genes has frequently been a problem. Although several online databases have been created for the storage and retrieval of experimentally validated primers, only a few thousand primer pairs are currently present in existing databases and the primers are not designed for use under a common PCR thermal profile. We previously reported the implementation of an algorithm to predict PCR primers for most known human and mouse genes. We now report the use of that resource to identify 17483 pairs of primers that have been experimentally verified to amplify unique sequences corresponding to distinct murine transcripts. The primer pairs have been validated by gel electrophoresis, DNA sequence analysis and thermal denaturation profile. In addition to the validation studies, we have determined the uniformity of amplification using the primers and the technical reproducibility of the QPCR reaction using the popular and inexpensive SYBR Green I detection method. We have identified an experimentally validated collection of murine primer pairs for PCR and QPCR which can be used under a common PCR thermal profile, allowing the evaluation of transcript abundance of a large number of genes in parallel. This feature is increasingly attractive for confirming and/or making more precise data trends observed from experiments performed with DNA microarrays.

Journal ArticleDOI
TL;DR: The importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes are revealed.
Abstract: Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results. In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers. We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.