scispace - formally typeset
Search or ask a question

Showing papers on "Pseudogene published in 2018"


Journal ArticleDOI
TL;DR: Recent integrative analyses have provided evidence that new computational platforms and experimental approaches can be harnessed together to distinguish key ceRNA interactions in specific cancers, which could facilitate the identification of robust biomarkers and therapeutic targets, and hence, more effective cancer therapies and better patient outcome and survival.
Abstract: Noncoding RNAs (ncRNAs) constitute the majority of the human transcribed genome. This largest class of RNA transcripts plays diverse roles in a multitude of cellular processes, and has been implicated in many pathological conditions, especially cancer. The different subclasses of ncRNAs include microRNAs, a class of short ncRNAs; and a variety of long ncRNAs (lncRNAs), such as lincRNAs, antisense RNAs, pseudogenes, and circular RNAs. Many studies have demonstrated the involvement of these ncRNAs in competitive regulatory interactions, known as competing endogenous RNA (ceRNA) networks, whereby lncRNAs can act as microRNA decoys to modulate gene expression. These interactions are often interconnected, thus aberrant expression of any network component could derail the complex regulatory circuitry, culminating in cancer development and progression. Recent integrative analyses have provided evidence that new computational platforms and experimental approaches can be harnessed together to distinguish key ceRNA interactions in specific cancers, which could facilitate the identification of robust biomarkers and therapeutic targets, and hence, more effective cancer therapies and better patient outcome and survival.

755 citations


Journal ArticleDOI
TL;DR: Integrated proteogenomics analysis workflow (IPAW) is presented, which combines peptide discovery, curation and validation in an integrated proteogensomics workflow, robustly identifying unknown coding regions and mutations.
Abstract: Proteogenomics enable the discovery of novel peptides (from unannotated genomic protein-coding loci) and single amino acid variant peptides (derived from single-nucleotide polymorphisms and mutations). Increasing the reliability of these identifications is crucial to ensure their usefulness for genome annotation and potential application as neoantigens in cancer immunotherapy. We here present integrated proteogenomics analysis workflow (IPAW), which combines peptide discovery, curation, and validation. IPAW includes the SpectrumAI tool for automated inspection of MS/MS spectra, eliminating false identifications of single-residue substitution peptides. We employ IPAW to analyze two proteomics data sets acquired from A431 cells and five normal human tissues using extended (pH range, 3–10) high-resolution isoelectric focusing (HiRIEF) pre-fractionation and TMT-based peptide quantitation. The IPAW results provide evidence for the translation of pseudogenes, lncRNAs, short ORFs, alternative ORFs, N-terminal extensions, and intronic sequences. Moreover, our quantitative analysis indicates that protein production from certain pseudogenes and lncRNAs is tissue specific. Proteogenomics enables the discovery of protein coding regions and disease-relevant mutations but their verification remains challenging. Here, the authors combine peptide discovery, curation and validation in an integrated proteogenomics workflow, robustly identifying unknown coding regions and mutations.

104 citations


Journal ArticleDOI
TL;DR: The genome sequence of Streptococcus pneumoniae strain D39 is unambiguously determined and several inversions previously undetected by short-read sequencing are revealed, showing that the pneumococcal transcriptional landscape is complex and includes many secondary, antisense and internal promoters.
Abstract: A precise understanding of the genomic organization into transcriptional units and their regulation is essential for our comprehension of opportunistic human pathogens and how they cause disease. Using single-molecule real-time (PacBio) sequencing we unambiguously determined the genome sequence of Streptococcus pneumoniae strain D39 and revealed several inversions previously undetected by short-read sequencing. Significantly, a chromosomal inversion results in antigenic variation of PhtD, an important surface-exposed virulence factor. We generated a new genome annotation using automated tools, followed by manual curation, reflecting the current knowledge in the field. By combining sequence-driven terminator prediction, deep paired-end transcriptome sequencing and enrichment of primary transcripts by Cappable-Seq, we mapped 1015 transcriptional start sites and 748 termination sites. We show that the pneumococcal transcriptional landscape is complex and includes many secondary, antisense and internal promoters. Using this new genomic map, we identified several new small RNAs (sRNAs), RNA switches (including sixteen previously misidentified as sRNAs), and antisense RNAs. In total, we annotated 89 new protein-encoding genes, 34 sRNAs and 165 pseudogenes, bringing the S. pneumoniae D39 repertoire to 2146 genetic elements. We report operon structures and observed that 9% of operons are leaderless. The genome data are accessible in an online resource called PneumoBrowse (https://veeninglab.com/pneumobrowse) providing one of the most complete inventories of a bacterial genome to date. PneumoBrowse will accelerate pneumococcal research and the development of new prevention and treatment strategies.

91 citations


Journal ArticleDOI
TL;DR: This comparative study provides a comprehensive insight into the evolution of avian TLR genetic variability and identified candidate positions in the receptors that have been likely shaped by direct molecular host–pathogen coevolutionary interactions and most probably play key functional roles in birds.
Abstract: Toll-like receptors (TLRs) are key sensor molecules in vertebrates triggering initial phases of immune responses to pathogens. The avian TLR family typically consists of ten receptors, each adapted to distinct ligands. To understand the complex evolutionary history of each avian TLR, we analyzed all members of the TLR family in the whole genome assemblies and target sequence data of 63 bird species covering all major avian clades. Our results indicate that gene duplication events most probably occurred in TLR1 before synapsids diversified from sauropsids. Unlike mammals, ssRNA-recognizing TLR7 has duplicated independently in several avian taxa, while flagellin-sensing TLR5 has pseudogenized multiple times in bird phylogeny. Our analysis revealed stronger positive, diversifying selection acting in TLR5 and the three-domain TLRs (TLR10 [TLR1A], TLR1 [TLR1B], TLR2A, TLR2B, TLR4) that face the extracellular space and bind complex ligands than in single-domain TLR15 and endosomal TLRs (TLR3, TLR7, TLR21). In total, 84 out of 306 positively selected sites were predicted to harbor substitutions dramatically changing the amino acid physicochemical properties. Furthermore, 105 positively selected sites were located in the known functionally relevant TLR regions. We found evidence for convergent evolution acting between birds and mammals at 54 of these sites. Our comparative study provides a comprehensive insight into the evolution of avian TLR genetic variability. Besides describing the history of avian TLR gene gain and gene loss, we also identified candidate positions in the receptors that have been likely shaped by direct molecular host-pathogen coevolutionary interactions and most probably play key functional roles in birds.

89 citations


Journal ArticleDOI
25 Apr 2018-PLOS ONE
TL;DR: By comparing the bittersweet plastid genome with all available Solanaceae sequences it was found that gene content and synteny are highly conserved across the family.
Abstract: Bittersweet (Solanum dulcamara) is a native Old World member of the nightshade family. This European diploid species can be found from marshlands to high mountainous regions and it is a common weed that serves as an alternative host and source of resistance genes against plant pathogens such as late blight (Phytophthora infestans). We sequenced the complete chloroplast genome of bittersweet, which is 155,580 bp in length and it is characterized by a typical quadripartite structure composed of a large (85,901 bp) and small (18,449 bp) single-copy region interspersed by two identical inverted repeats (25,615 bp). It consists of 112 unique genes from which 81 are protein-coding, 27 tRNA and four rRNA genes. All bittersweet plastid genes including non-functional ones and even intergenic spacer regions are transcribed in primary plastid transcripts covering 95.22% of the genome. These are later substantially edited in a post-transcriptional phase to activate gene functions. By comparing the bittersweet plastid genome with all available Solanaceae sequences we found that gene content and synteny are highly conserved across the family. During genome comparison we have identified several annotation errors, which we have corrected in a manual curation process then we have identified the major plastid genome structural changes in Solanaceae. Interpreted in a phylogenetic context they seem to provide additional support for larger clades. The plastid genome sequence of bittersweet could help to benchmark Solanaceae plastid genome annotations and could be used as a reference for further studies. Such reliable annotations are important for gene diversity calculations, synteny map constructions and assigning partitions for phylogenetic analysis with de novo sequenced plastomes of Solanaceae.

79 citations


Journal ArticleDOI
01 Aug 2018-Cancers
TL;DR: The latest developments in pseudogene research are discussed, focusing on how pseudogenes impact tumorigenesis through different gene regulation mechanisms, and given the high sequence homology with the corresponding parent genes, the challenges for pseudogeneResearch are discussed.
Abstract: Functional genomics has provided evidence that the human genome transcribes a large number of non-coding genes in addition to protein-coding genes, including microRNAs and long non-coding RNAs (lncRNAs). Among the group of lncRNAs are pseudogenes that have not been paid attention in the past, compared to other members of lncRNAs. However, increasing evidence points the important role of pseudogenes in diverse cellular functions, and dysregulation of pseudogenes are often associated with various human diseases including cancer. Like other types of lncRNAs, pseudogenes can also function as master regulators for gene expression and thus, they can play a critical role in various aspects of tumorigenesis. In this review we discuss the latest developments in pseudogene research, focusing on how pseudogenes impact tumorigenesis through different gene regulation mechanisms. Given the high sequence homology with the corresponding parent genes, we also discuss challenges for pseudogene research.

70 citations


Journal ArticleDOI
TL;DR: An extensive gene:pseudogene network comprising multiple miRNAs and multiple pseudogenes derived from a single parental gene could be regulated through multiple mechanisms to modulate iron storage in various signaling pathways, the deregulation of which results in PCa development and progression.
Abstract: Non-coding RNAs play a vital role in diverse cellular processes. Pseudogenes, which are non-coding homologs of protein-coding genes, were once considered non-functional evolutional relics. However, recent studies have shown that pseudogene transcripts can regulate their parental transcripts by sequestering shared microRNAs (miRNAs), thus acting as competing endogenous RNAs (ceRNAs). In this study, we utilize an unbiased screen to identify the ferritin heavy chain 1 (FTH1) transcript and multiple FTH1 pseudogenes as targets of several oncogenic miRNAs in prostate cancer (PCa). We characterize the critical role of this FTH1 gene:pseudogene:miRNA network in regulating tumorigenesis in PCa, whereby oncogenic miRNAs downregulate the expression of FTH1 and its pseudogenes to drive oncogenesis. We further show that impairing miRNA binding and subsequent ceRNA crosstalk completely rescues the slow growth phenotype in vitro and in vivo. Our results also demonstrate the reciprocal regulation between the pseudogenes and intracellular iron levels, which are crucial for multiple physiological and pathophysiological processes. In summary, we describe an extensive gene:pseudogene network comprising multiple miRNAs and multiple pseudogenes derived from a single parental gene. The network could be regulated through multiple mechanisms to modulate iron storage in various signaling pathways, the deregulation of which results in PCa development and progression.

65 citations


Book ChapterDOI
TL;DR: In this article, the association between oncogenes and miRNAs-pseudogenes was reviewed and determined in human cancer by the CellMiner web-tool.
Abstract: Our understanding of cancer pathways has been changed by the determination of noncoding transcripts in the human genome in recent years. miRNAs and pseudogenes are key players of the noncoding transcripts from the genome, and alteration of their expression levels provides clues for significant biomarkers in pathogenesis of diseases. Especially, miRNAs and pseudogenes have both oncogenic and tumor-suppressive roles in each step of cancer tumorigenesis. In this current study, association between oncogenes and miRNAs-pseudogenes was reviewed and determined in human cancer by the CellMiner web-tool.

60 citations


Journal ArticleDOI
TL;DR: An in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins.
Abstract: Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.

53 citations


Journal ArticleDOI
TL;DR: An approach combining long-read sequencing and hybridization capture to yield full-length transcript information and confidently distinguish between nearly identical genes/paralogs is developed, significantly improves gene annotation.
Abstract: Despite the importance of duplicate genes for evolutionary adaptation, accurate gene annotation is often incomplete, incorrect, or lacking in regions of segmental duplication. We developed an approach combining long-read sequencing and hybridization capture to yield full-length transcript information and confidently distinguish between nearly identical genes/paralogs. We used biotinylated probes to enrich for full-length cDNA from duplicated regions, which were then amplified, size-fractionated, and sequenced using single-molecule, long-read sequencing technology, permitting us to distinguish between highly identical genes by virtue of multiple paralogous sequence variants. We examined 19 gene families as expressed in developing and adult human brain, selected for their high sequence identity (average >99%) and overlap with human-specific segmental duplications (SDs). We characterized the transcriptional differences between related paralogs to better understand the birth-death process of duplicate genes and particularly how the process leads to gene innovation. In 48% of the cases, we find that the expressed duplicates have changed substantially from their ancestral models due to novel sites of transcription initiation, splicing, and polyadenylation, as well as fusion transcripts that connect duplication-derived exons with neighboring genes. We detect unannotated open reading frames in genes currently annotated as pseudogenes, while relegating other duplicates to nonfunctional status. Our method significantly improves gene annotation, specifically defining full-length transcripts, isoforms, and open reading frames for new genes in highly identical SDs. The approach will be more broadly applicable to genes in structurally complex regions of other genomes where the duplication process creates novel genes important for adaptive traits.

51 citations


Journal ArticleDOI
TL;DR: This database will provide insights into the transcriptional regulation, expression, functions and mechanisms of pseudogenes as well as their roles in biological processes and diseases.
Abstract: Although thousands of pseudogenes have been annotated in the human genome, their transcriptional regulation, expression profiles and functional mechanisms are largely unknown. In this study, we developed dreamBase (http://rna.sysu.edu.cn/dreamBase) to facilitate the investigation of DNA modification, RNA regulation and protein binding of potential expressed pseudogenes from multidimensional high-throughput sequencing data. Based on ∼5500 ChIP-seq and DNase-seq datasets, we identified genome-wide binding profiles of various transcription-associated factors around pseudogene loci. By integrating ∼18 000 RNA-seq data, we analysed the expression profiles of pseudogenes and explored their co-expression patterns with their parent genes in 32 cancers and 31 normal tissues. By combining microRNA binding sites, we demonstrated complex post-transcriptional regulation networks involving 275 microRNAs and 1201 pseudogenes. We generated ceRNA networks to illustrate the crosstalk between pseudogenes and their parent genes through competitive binding of microRNAs. In addition, we studied transcriptome-wide interactions between RNA binding proteins (RBPs) and pseudogenes based on 458 CLIP-seq datasets. In conjunction with epitranscriptome sequencing data, we also mapped 1039 RNA modification sites onto 635 pseudogenes. This database will provide insights into the transcriptional regulation, expression, functions and mechanisms of pseudogenes as well as their roles in biological processes and diseases.

Journal ArticleDOI
TL;DR: The first comparative genomics study for L. helveticus identified a core genome of 988 genes and sets of unique, strain-specific genes ranging from about 30 to more than 200 genes, and found that a large number of pseudogenes were enriched for functional Gene Ontology categories such as amino acid transmembrane transport and carbohydrate metabolism, which is in line with a reductive genome evolution in the rich natural habitat of L.Helvticus.
Abstract: Although complete genome sequences hold particular value for an accurate description of core genomes, the identification of strain-specific genes, and as the optimal basis for functional genomics studies, they are still largely underrepresented in public repositories. Based on an assessment of the genome assembly complexity for all lactobacilli, we used Pacific Biosciences’ long read technology to sequence and de novo assemble the genomes of three Lactobacillus helveticus starter strains, raising the number of completely sequenced strains to 12. The first comparative genomics study for L. helveticus - to our knowledge - identified a core genome of 988 genes and sets of unique, strain-specific genes ranging from about 30 to more than 200 genes. Importantly, the comparison of MiSeq- and PacBio-based assemblies uncovered that not only accessory but also core genes can be missed in incomplete genome assemblies based on short reads. Analysis of the three genomes revealed that a large number of pseudogenes were enriched for functional Gene Ontology categories such as amino acid transmembrane transport and carbohydrate metabolism, which is in line with a reductive genome evolution in the rich natural habitat of L. helveticus. Notably, the functional Clusters of Orthologous Groups of proteins categories “cell wall/membrane biogenesis” and “defense mechanisms” were found to be enriched among the strain-specific genes. A genome mining effort uncovered examples where an experimentally observed phenotype could be linked to the underlying genotype, such as for cell envelope proteinase PrtH3 of strain FAM8627. Another possible link identified for peptidoglycan hydrolases will require further experiments. Of note, strain FAM22155 did not harbor a CRISPR/Cas system; its loss was also observed in other L. helveticus strains and lactobacillus species, thus questioning the value of the CRISPR/Cas system for diagnostic purposes. Importantly, the complete genome sequences proved to be very useful for the analysis of natural whey starter cultures with metagenomics, as a larger percentage of the sequenced reads of these complex mixtures could be unambiguously assigned down to the strain level.

Journal ArticleDOI
TL;DR: In the review, classification of pseudogenes, methods of their detection in the genome, and the problem of their evolutionary conservatism and prevalence among species belonging to different taxonomic groups in the light of modern data are addressed.
Abstract: Pseudogene is a gene copy that has lost its original function. For a long time, pseudogenes have been considered as “junk DNA” that inevitably arises as a result of ongoing evolutionary process. However, experimental data obtained during recent years indicate this understanding of the nature of pseudogenes is not entirely correct, and many pseudogenes perform important genetic functions. In the review, we have addressed classification of pseudogenes, methods of their detection in the genome, and the problem of their evolutionary conservatism and prevalence among species belonging to different taxonomic groups in the light of modern data. The mechanisms of gene expression regulation by pseudogenes and the role of pseudogenes in pathogenesis of various human diseases are discussed.

Journal ArticleDOI
TL;DR: The phylogenetic inference based on 63 plastid protein-coding genes of 38 taxa supports three major clades within Malpighiales order and has flax (Linaceae) sister to Chrysobalanaceae family, differing from earlier studies that included Linaceae into the euphorbioid clade.
Abstract: The plastome of Linum usitatissimum was completely sequenced allowing analyses of evolution of genome structure, RNA editing sites, molecular markers, and indicating the position of Linaceae within Malpighiales. Flax (Linum usitatissimum L.) is an economically important crop used as food, feed, and industrial feedstock. It belongs to the Linaceae family, which is noted by high morphological and ecological diversity. Here, we reported the complete sequence of flax plastome, the first species within Linaceae family to have the plastome sequenced, assembled and characterized in detail. The plastome of flax is a circular DNA molecule of 156,721 bp with a typical quadripartite structure including two IRs of 31,990 bp separating the LSC of 81,767 bp and the SSC of 10,974 bp. It shows two expansion events from IRB to LSC and from IRB to SSC, and a contraction event in the IRA-LSC junction, which changed significantly the size and the gene content of LSC, SSC and IRs. We identified 109 unique genes and 2 pseudogenes (rpl23 and ndhF). The plastome lost the conserved introns of clpP gene and the complete sequence of rps16 gene. The clpP, ycf1, and ycf2 genes show high nucleotide and aminoacid divergence, but they still possibly retain the functionality. Moreover, we also identified 176 SSRs, 20 tandem repeats, and 39 dispersed repeats. We predicted in 18 genes a total of 53 RNA editing sites of which 32 were not found before in other species. The phylogenetic inference based on 63 plastid protein-coding genes of 38 taxa supports three major clades within Malpighiales order. One of these clades has flax (Linaceae) sister to Chrysobalanaceae family, differing from earlier studies that included Linaceae into the euphorbioid clade.

Journal ArticleDOI
TL;DR: A large-scale genome-wide analysis of the IR gene repertoire in Lepidoptera identifies potential IR candidates for olfactory, gustatory and oviposition behaviors in the cotton bollworm and suggests that some A-IRs in H. armigera likely bear a dual function with their involvement in olfaction and gustation.

Journal ArticleDOI
TL;DR: It is found that genes in highly variable families have high turnover rates and tend to be involved in processes that have diverged between Solanaceae species, whereas genes in low-variability families tend to have housekeeping roles, and duplication mechanism impacts gene family turnover.
Abstract: Gene duplication and loss contribute to gene content differences as well as phenotypic divergence across species. However, the extent to which gene content varies among closely related plant species and the factors responsible for such variation remain unclear. Here, using the Solanaceae family as a model and Pfam domain families as a proxy for gene families, we investigated variation in gene family sizes across species and the likely factors contributing to the variation. We found that genes in highly variable families have high turnover rates and tend to be involved in processes that have diverged between Solanaceae species, whereas genes in low-variability families tend to have housekeeping roles. In addition, genes in high- and low-variability gene families tend to be duplicated by tandem and whole genome duplication, respectively. This finding together with the observation that genes duplicated by different mechanisms experience different selection pressures suggest that duplication mechanism impacts gene family turnover. We explored using pseudogene number as a proxy for gene loss but discovered that a substantial number of pseudogenes are actually products of pseudogene duplication, contrary to the expectation that most plant pseudogenes are remnants of once-functional duplicates. Our findings reveal complex relationships between variation in gene family size, gene functions, duplication mechanism, and evolutionary rate. The patterns of lineage-specific gene family expansion within the Solanaceae provide the foundation for a better understanding of the genetic basis underlying phenotypic diversity in this economically important family.

Journal ArticleDOI
TL;DR: This first report of paraphyly of the Dactylogyridea highlights the need to generate more molecular data for monogenean parasites, in order to be able to clarify their relationships using large datasets, as single-gene markers appear to provide a phylogenetic resolution which is too low for the task.
Abstract: Recent mitochondrial phylogenomics studies have reported a sister-group relationship of the orders Capsalidea and Dactylogyridea, which is inconsistent with previous morphology- and molecular-based phylogenies. As Dactylogyridea mitochondrial genomes (mitogenomes) are currently represented by only one family, to improve the phylogenetic resolution, we sequenced and characterized two dactylogyridean parasites, Lamellodiscus spari and Lepidotrema longipenis, belonging to a non-represented family Diplectanidae. The L. longipenis mitogenome (15,433 bp) contains the standard 36 flatworm mitochondrial genes (atp8 is absent), whereas we failed to detect trnS1, trnC and trnG in L. spari (14,614 bp). Both mitogenomes exhibit unique gene orders (among the Monogenea), with a number of tRNA rearrangements. Both long non-coding regions contain a number of different (partially overlapping) repeat sequences. Intriguingly, these include putative tRNA pseudogenes in a tandem array (17 trnV pseudogenes in L. longipenis, 13 trnY pseudogenes in L. spari). Combined nucleotide diversity, non-synonymous/synonymous substitutions ratio and average sequence identity analyses consistently showed that nad2, nad5 and nad4 were the most variable PCGs, whereas cox1, cox2 and cytb were the most conserved. Phylogenomic analysis showed that the newly sequenced species of the family Diplectanidae formed a sister-group with the Dactylogyridae + Capsalidae clade. Thus Dactylogyridea (represented by the Diplectanidae and Dactylogyridae) was rendered paraphyletic (with high statistical support) by the nested Capsalidea (represented by the Capsalidae) clade. Our results show that nad2, nad5 and nad4 (fast-evolving) would be better candidates than cox1 (slow-evolving) for species identification and population genetics studies in the Diplectanidae. The unique gene order pattern further suggests discontinuous evolution of mitogenomic gene order arrangement in the Class Monogenea. This first report of paraphyly of the Dactylogyridea highlights the need to generate more molecular data for monogenean parasites, in order to be able to clarify their relationships using large datasets, as single-gene markers appear to provide a phylogenetic resolution which is too low for the task.

Journal ArticleDOI
Hai-Ying Liu1, Yan Yu1, Yi-Qi Deng1, Juan Li1, Zi-Xuan Huang1, Song-Dong Zhou1 
TL;DR: Providing L. henrici genomic resources and the comparative analysis of Lilium chloroplast genomes will be beneficial for the evolutionary study and phylogenetic reconstruction of the genus Lilium, molecular barcoding in population genetics.
Abstract: Lilium henrici Franchet, which belongs to the family Liliaceae, is an endangered plant native to China. The wild populations of L. henrici have been largely reduced by habitat degradation or loss. In our study, we determined the whole chloroplast genome sequence for L. henrici and compared its structure with other Lilium (including Nomocharis) species. The chloroplast genome of L. henrici is a circular structure and 152,784 bp in length. The large single copy and small single copy is 82,429 bp and 17,533 bp in size, respectively, and the inverted repeats are 26,411 bp in size. The L. henrici chloroplast genome contains 116 different genes, including 78 protein coding genes, 30 tRNA genes, 4 rRNA genes, and 4 pseudogenes. There were 51 SSRs detected in the L. henrici chloroplast genome sequence. Genic comparison among L. henrici with other Lilium (including Nomocharis) chloroplast genomes shows that the sequence lengths and gene contents show little variation, the only differences being in three pseudogenes. Phylogenetic analysis revealed that N. pardanthina was a sister species to L. henrici. Overall, this study, providing L. henrici genomic resources and the comparative analysis of Lilium chloroplast genomes, will be beneficial for the evolutionary study and phylogenetic reconstruction of the genus Lilium, molecular barcoding in population genetics.

Journal ArticleDOI
TL;DR: The first high resolution map of 5S and 45S rDNA array contacts with the rest of the genome combining over 15 billion Hi-C reads from several experiments was built and identified functional categories whose dispersed genes coalesced in proximity to the r DNA arrays or instead avoided proximity with the rDNA arrays.
Abstract: The repeated rDNA array gives rise to the nucleolus, an organelle that is central to cellular processes as varied as stress response, cell cycle regulation, RNA modification, cell metabolism, and genome stability. The rDNA array is also responsible for the production of more than 70% of all cellular RNAs (the ribosomal RNAs). The rRNAs are produced from two sets of loci: the 5S rDNA array resides exclusively on human chromosome 1 while the 45S rDNA arrays reside on the short arm of five human acrocentric chromosomes. These critical genome elements have remained unassembled and have been excluded from all Hi-C analyses to date. Here we built the first high resolution map of 5S and 45S rDNA array contacts with the rest of the genome combining over 15 billion Hi-C reads from several experiments. The data enabled sufficiently high coverage to map rDNA-genome interactions with 1MB resolution and identify rDNA-gene contacts. The map showed that the 5S and 45S arrays display preferential contact at common sites along the genome but are not themselves sufficiently close to yield 5S-45S Hi-C contacts. Ribosomal DNA contacts are enriched in segments of closed, repressed, and late replicating chromatin, as well as CTCF binding sites. Finally, we identified functional categories whose dispersed genes coalesced in proximity to the rDNA arrays or instead avoided proximity with the rDNA arrays. The observations further our understanding of the spatial localization of rDNA arrays and their contribution to the architecture of the cell nucleus.

Journal ArticleDOI
Sui Wang1, Chuanping Yang1, Xiyang Zhao1, Su Chen1, Guan-Zheng Qu1 
TL;DR: The phylogenetic analysis suggested that B. platyphylla had a closer evolutionary relationship with B. pendula than B. nana and the sequence of the Fagales species cp genome was relatively conserved, but there were still some high variation regions that could be used as molecular markers.
Abstract: Betula platyphylla is a common tree species in northern China that has high economic and medicinal value. Our laboratory has been devoted to genome research on B. platyphylla for approximately 10 years. As primary organelle genomes, the complete genome sequences of chloroplasts are important to study the divergence of species, RNA editing and phylogeny. In this study, we sequenced and analyzed the complete chloroplast (cp) genome sequence of B. platyphylla. The complete cp genome of B. platyphylla was 160,518 bp in length, which included a pair of inverted repeats (IRs) of 26,056 bp that separated a large single copy (LSC) region of 89,397 bp and a small single copy (SSC) region of 19,009 bp. The annotation contained a total of 129 genes, including 84 protein-coding genes, 37 tRNA genes and 8 rRNA genes. There were 3 genes using alternative initiation codons. Comparative genomics showed that the sequence of the Fagales species cp genome was relatively conserved, but there were still some high variation regions that could be used as molecular markers. The IR expansion event of B. platyphylla resulted in larger cp genomes and rps19 pseudogene formation. The simple sequence repeat (SSR) analysis showed that there were 105 SSRs in the cp genome of B. platyphylla. RNA editing sites recognition indicated that at least 80 RNA editing events occurred in the cp genome. Most of the substitutions were C to U, while a small proportion of them were not. In particular, three editing loci on the rRNA were converted to more than two other bases that had never been reported. For synonymous conversion, most of them increased the relative synonymous codon usage (RSCU) value of the codons. The phylogenetic analysis suggested that B. platyphylla had a closer evolutionary relationship with B. pendula than B. nana. In this study, we not only obtained and annotated the complete cp genome sequence of B. platyphylla, but we also identified new RNA editing sites and predicted the phylogenetic relationships among Fagales species. These findings will facilitate genomic, genetic engineering and phylogenetic studies of this important species.

Journal ArticleDOI
TL;DR: Signatures of LncRNAs and pseudogenes can predict overall survival and recurrence of RCC, and six of these, including LINC00520, PIK3CD-AS1, LINC01559, CEACAM22P, MSL3P1 and TREML3P could be non-invasive biomarkers of R CC.
Abstract: Increasing evidence suggests a critical role for long noncoding RNAs (LncRNAs) and pseudogenes in cancer. Renal cell carcinoma (RCC), the most common primary renal neoplasm, is highly aggressive and difficult to treat because of its resistance to chemotherapy and radiotherapy. Despite many identified LncRNAs and pseudogenes, few have been clearly elucidated. This study provides new insights into LncRNAs and pseudogenes in the prognosis of RCC. We searched an online database to interrogate alterations and clinical data on cBioPortal. We analysed LncRNA and pseudogene signatures to predict the prognosis of RCC based on a Cox model. We also found potential serum biomarkers of RCC and validated them in 32 RCC patients, as well as healthy controls. Alterations were found in 2553 LncRNAs and 8901 pseudogenes and occurred in up to 23% of all cases. Among these, 27 LncRNAs and 45 pseudogenes were closely related to prognosis. We also identified signatures of LncRNAs and pseudogenes that can predict overall survival and recurrence of RCC. We then validated the relative levels of these LncRNAs and pseudogenes in the serum of 32 patients. Six of these, including LINC00520, PIK3CD-AS1, LINC01559, CEACAM22P, MSL3P1 and TREML3P, could be non-invasive biomarkers of RCC. Finally, we selected PIK3CD-AS1 to determine its role in RCC and found that upregulation of PIK3CD-AS1 was closely associated with higher tumour stage and metastasis. These signatures of LncRNAs and pseudogenes can predict overall survival and recurrence of RCC. LINC00520, PIK3CD-AS1, LINC01559, CEACAM22P, MSL3P1 and TREML3P could be non-invasive biomarkers of RCC. These data suggest the important roles of LncRNAs and pseudogenes in RCC, and therefore provides us new insights into the prognosis of RCC.

Journal ArticleDOI
TL;DR: It is found that the pseudo-3′UTR of BRAFP1, a previously described oncogenic ceRNA, has reduced substitutions relative to its pseudo-coding sequence, and sequence constraint on MREs shared between the parent gene, BRAF, and the pseudogene is shown.
Abstract: The competitive endogenous RNA (ceRNA) hypothesis is an attractively simple model to explain the biological role of many putatively functionless noncoding RNAs. Under this model, there exist transcripts in the cell whose role is to titrate out microRNAs such that the expression level of another target sequence is altered. That it is logistically possible for expression of one microRNA recognition element (MRE)-containing transcript to affect another is seen in the multiple examples of pathogenic effects of inappropriate expression of MRE-containing RNAs. However, the role, if any, of ceRNAs in normal biological processes and at physiological levels is disputed. By comparison of parent genes and pseudogenes we show, both for a specific example and genome-wide, that the pseudo-3' untranslated regions (3'UTRs) of expressed pseudogenes are frequently retained and are under selective constraint in mammalian genomes. We found that the pseudo-3'UTR of BRAFP1, a previously described oncogenic ceRNA, has reduced substitutions relative to its pseudo-coding sequence, and we show sequence constraint on MREs shared between the parent gene, BRAF, and the pseudogene. Investigation of RNA-seq data reveals expression of BRAFP1 in normal somatic tissues in human and in other primates, consistent with biological ceRNA functionality of this pseudogene in nonpathogenic cellular contexts. Furthermore, we find that on a genome-wide scale pseudo-3'UTRs of mammalian pseudogenes (n = 1,629) are under stronger selective constraint than their pseudo-coding sequence counterparts, and are more often retained and expressed. Our results suggest that many human pseudogenes, often considered nonfunctional, may have an evolutionarily constrained role, consistent with the ceRNA hypothesis.

Journal ArticleDOI
TL;DR: The global apricot transcriptome response to PPV-D infection is characterized identifying six PPVres locus genes (ParP-1 to ParP-6) differentially expressed in resistant/susceptible cultivars and hypothesizing that ParPMC2 is a pseudogene that mediates down-regulation of its functional paralog Par PMC1 by silencing.
Abstract: Plum pox virus (PPV), causing Sharka disease, is one of the main limiting factors for Prunus production worldwide. In apricot (Prunus armeniaca L.) the major PPV resistance locus (PPVres), comprising ~ 196 kb, has been mapped to the upper part of linkage group 1. Within the PPVres, 68 genomic variants linked in coupling to PPV resistance were identified within 23 predicted transcripts according to peach genome annotation. Taking into account the predicted functions inferred from sequence homology, some members of a cluster of meprin and TRAF-C homology domain (MATHd)-containing genes were pointed as PPV resistance candidate genes. Here, we have characterized the global apricot transcriptome response to PPV-D infection identifying six PPVres locus genes (ParP-1 to ParP-6) differentially expressed in resistant/susceptible cultivars. Two of them (ParP-3 and ParP-4), that encode MATHd proteins, appear clearly down-regulated in resistant cultivars, as confirmed by qRT-PCR. Concurrently, variant calling was performed using whole-genome sequencing data of 24 apricot cultivars (10 PPV-resistant and 14 PPV-susceptible) and 2 wild relatives (PPV-susceptible). ParP-3 and ParP-4, named as Prunus armeniaca PPVres MATHd-containing genes (ParPMC), are the only 2 genes having allelic variants linked in coupling to PPV resistance. ParPMC1 has 1 nsSNP, while ParPMC2 has 15 variants, including a 5-bp deletion within the second exon that produces a frameshift mutation. ParPMC1 and ParPMC2 are adjacent and highly homologous (87.5% identity) suggesting they are paralogs originated from a tandem duplication. Cultivars carrying the ParPMC2 resistant (mutated) allele show lack of expression in both ParPMC2 and especially ParPMC1. Accordingly, we hypothesize that ParPMC2 is a pseudogene that mediates down-regulation of its functional paralog ParPMC1 by silencing. As a whole, results strongly support ParPMC1 and/or ParPMC2 as host susceptibility genes required for PPV infection which silencing may confer PPV resistance trait. This finding may facilitate resistance breeding by marker-assisted selection and pave the way for gene edition approaches in Prunus.

Journal ArticleDOI
TL;DR: An association between expression of OCT4 and pseudogenes and cancer prognosis were established, which may serve as a therapeutic target for various human cancers.
Abstract: OCT4 is a master transcription factor that regulates the pluripotency of pluripotent stem cells and cancer stem cells along with other factors, including SOX2, KLF4, and C-MYC. Three different transcripts, OCT4A, OCT4B, and OCT4B1, are known to be generated by alternative splicing and eight OCT4 pseudogenes have been found in the human genome. Among them, we examined OCT4 and three pseudogenes (POU5F1P1, POU5F1P3, and POU5F1P4) because of their high expression possibility in cancer. In addition, previous studies indicated that OCT4 expression is augmented in cervical cancer and associated with poor prognosis, whereas OCT4 is down-regulated and correlated with good clinical outcomes in breast cancer. Because of these conflicting reports, we systematically evaluated whether expression of OCT4 and its pseudogenes can serve as oncogenic markers in various human cancers using the Oncomine database. Moreover, copy number alterations and mutations in OCT4 gene and its pseudogenes were analyzed using cBioPortal and the relationship between expression of OCT4 and pseudogenes and survival probability of cancer patients were explored using Kaplan-Meier plotter, OncoLnc, PROGgeneV2, and PrognoScan databases. Multivariate survival analysis was further conducted to determine the risk of the expression of the occurrence of OCT4 and its pseudogenes on certain cancer types using data from the Kaplan-Meier plotter. Overall, an association between expression of OCT4 and pseudogenes and cancer prognosis were established, which may serve as a therapeutic target for various human cancers.

Journal ArticleDOI
TL;DR: Overall, the analysis provides a detailed picture of the ERV-W sequences colonization of the primate lineages genomes as well as novel insights into the evolution and origin of the group.
Abstract: The genomes of all vertebrates harbor remnants of ancient retroviral infections, having affected the germ line cells during the last 100 million years. These sequences, named Endogenous Retroviruses (ERVs), have been transmitted to the offspring in a Mendelian way, being relatively stable components of the host genome even long after their exogenous counterparts went extinct. Among human ERVs (HERVs), the HERV-W group is of particular interest for our physiology and pathology. A HERV-W provirus in locus 7q21.2 has been coopted during evolution to exert an essential role in placenta, and the group expression has been tentatively linked to Multiple Sclerosis and other diseases. Following up on a detailed analysis of 213 HERV-W insertions in the human genome, we now investigated the ERV-W group genomic spread within primate lineages. We analyzed HERV-W orthologous loci in the genome sequences of 12 non-human primate species belonging to Simiiformes (parvorders Catarrhini and Platyrrhini), Tarsiiformes and to the most primitive Prosimians. Analysis of HERV-W orthologous loci in non-human Catarrhini primates revealed species-specific insertions in the genomes of Chimpanzee (3), Gorilla (4), Orangutan (6), Gibbon (2) and especially Rhesus Macaque (66). Such sequences were acquired in a retroviral fashion and, in the majority of cases, by L1-mediated formation of processed pseudogenes. There were also a number of LTR-LTR homologous recombination events that occurred subsequent to separation of Catarrhini sub-lineages. Moreover, we retrieved 130 sequences in Marmoset and Squirrel Monkeys (family Cebidae, Platyrrhini parvorder), identified as ERV1–1_CJa based on RepBase annotations, which appear closely related to the ERV-W group. Such sequences were also identified in Atelidae and Pitheciidae, representative of the other Platyrrhini families. In contrast, no ERV-W-related sequences were found in genome sequence assemblies of Tarsiiformes and Prosimians. Overall, our analysis now provides a detailed picture of the ERV-W sequences colonization of the primate lineages genomes, revealing the exact dynamics of ERV-W locus formations as well as novel insights into the evolution and origin of the group.

Journal ArticleDOI
12 Mar 2018-PLOS ONE
TL;DR: A Maximum Likelihood phylogenetic tree based on the complete chloroplast genomes of 38 species from 13 families strongly indicated that E. aureum is positioned as the sister of Colocasia esculenta within the Araceae family.
Abstract: Epipremnum aureum is an important foliage plant in the Araceae family. In this study, we have sequenced the complete chloroplast genome of E. aureum by using Illumina Hiseq sequencing platforms. This genome is a double-stranded circular DNA sequence of 164,831 bp that contains 35.8% GC. The two inverted repeats (IRa and IRb; 26,606 bp) are spaced by a small single-copy region (22,868 bp) and a large single-copy region (88,751 bp). The chloroplast genome has 131 (113 unique) functional genes, including 86 (79 unique) protein-coding genes, 37 (30 unique) tRNA genes, and eight (four unique) rRNA genes. Tandem repeats comprise the majority of the 43 long repetitive sequences. In addition, 111 simple sequence repeats are present, with mononucleotides being the most common type and di- and tetranucleotides being infrequent events. Positive selection pressure on rps12 in the E. aureum chloroplast has been demonstrated via synonymous and nonsynonymous substitution rates and selection pressure sites analyses. Ycf15 and infA are pseudogenes in this species. We constructed a Maximum Likelihood phylogenetic tree based on the complete chloroplast genomes of 38 species from 13 families. Those results strongly indicated that E. aureum is positioned as the sister of Colocasia esculenta within the Araceae family. This work may provide information for further study of the molecular phylogenetic relationships within Araceae, as well as molecular markers and breeding novel varieties by chloroplast genetic-transformation of E. aureum in particular.

Journal ArticleDOI
TL;DR: In this article, the metagenome of a Coxiella-like endosymbionts (CLE) and their host Rhipicephalus sanguineus ticks (CRs) was sequenced and compared to the previously published genome of its close relative, CLE of R. turanicus (CRt).
Abstract: Understanding the symbiotic interaction between Coxiella-like endosymbionts (CLE) and their tick hosts is challenging due to lack of isolates and difficulties in tick functional assays. Here we sequenced the metagenome of a CLE population from wild Rhipicephalus sanguineus ticks (CRs) and compared it to the previously published genome of its close relative, CLE of R. turanicus (CRt). The tick hosts are closely related sympatric species, and their two endosymbiont genomes are highly similar with only minor differences in gene content. Both genomes encode numerous pseudogenes, consistent with an ongoing genome reduction process. In silico flux balance metabolic analysis (FBA) revealed the excess production of L-proline for both genomes, indicating a possible proline transport from Coxiella to the tick. Additionally, both CR genomes encode multiple copies of the proline/betaine transporter, proP gene. Modelling additional Coxiellaceae members including other tick CLE, did not identify proline as an excreted metabolite. Although both CRs and CRt genomes encode intact B vitamin synthesis pathway genes, which are presumed to underlay the mechanism of CLE-tick symbiosis, the FBA analysis indicated no changes for their products. Therefore, this study provides new testable hypotheses for the symbiosis mechanism and a better understanding of CLE genome evolution and diversity.

Journal ArticleDOI
TL;DR: Age‐related differences in unique RNA transcripts were further validated in an expanded cohort and protein interaction networks revealed distinct clusters of functionally‐related protein‐coding genes in both age groups, providing timely and relevant insight into the exRNA repertoire in serum and its change with aging.
Abstract: Circulating extracellular RNAs (exRNAs) are potential biomarkers of disease. We thus hypothesized that age-related changes in exRNAs can identify age-related processes. We profiled both large and small RNAs in human serum to investigate changes associated with normal aging. exRNA was sequenced in 13 young (30-32 years) and 10 old (80-85 years) African American women to identify all RNA transcripts present in serum. We identified age-related differences in several RNA biotypes, including mitochondrial transfer RNAs, mitochondrial ribosomal RNA, and unprocessed pseudogenes. Age-related differences in unique RNA transcripts were further validated in an expanded cohort. Pathway analysis revealed that EIF2 signaling, oxidative phosphorylation, and mitochondrial dysfunction were among the top pathways shared between young and old. Protein interaction networks revealed distinct clusters of functionally-related protein-coding genes in both age groups. These data provide timely and relevant insight into the exRNA repertoire in serum and its change with aging.

Journal ArticleDOI
TL;DR: The data presented suggests that intra species variations at chromosome and gene level are more likely to influence differences in tropism as well as response to treatment, and contributes to greater understanding of parasite molecular mechanisms underpinning these differences.
Abstract: Leishmaniasis is a neglected tropical disease with diverse clinical phenotypes, determined by parasite, host and vector interactions. Despite the advances in molecular biology and the availability of more Leishmania genome references in recent years, the association between parasite species and distinct clinical phenotypes remains poorly understood. We present a genomic comparison of an atypical variant of Leishmania donovani from a South Asian focus, where it mostly causes cutaneous form of leishmaniasis. Clinical isolates from six cutaneous leishmaniasis patients (CL-SL); 2 of whom were poor responders to antimony (CL-PR), and two visceral leishmaniasis patients (VL-SL) were sequenced on an Illumina MiSeq platform. Chromosome aneuploidy was observed in both groups but was more frequent in CL-SL. 248 genes differed by 2 fold or more in copy number among the two groups. Genes involved in amino acid use (LdBPK_271940) and energy metabolism (LdBPK_271950), predominated the VL-SL group with the same distribution pattern reflected in gene tandem arrays. Genes encoding amastins were present in higher copy numbers in VL-SL and CL-PR as well as being among predicted pseudogenes in CL-SL. Both chromosome and SNP profiles showed CL-SL and VL-SL to form two distinct groups. While expected heterozygosity was much higher in VL-SL, SNP allele frequency patterns did not suggest potential recent recombination breakpoints. The SNP/indel profile obtained using the more recently generated PacBio sequence did not vary markedly from that based on the standard LdBPK282A1 reference. Several genes previously associated with resistance to antimonials were observed in higher copy numbers in the analysis of CL-PR. H-locus amplification was seen in one cutaneous isolate which however did not belong to the CL-PR group. The data presented suggests that intra species variations at chromosome and gene level are more likely to influence differences in tropism as well as response to treatment, and contributes to greater understanding of parasite molecular mechanisms underpinning these differences. These findings should be substantiated with a larger sample number and expression/functional studies.

Journal ArticleDOI
TL;DR: The study highlights the duplication and sub-functionalization of the MIP family, its strong coordinated expression with genes involved in growth and transport, and the putative classes of TFs responsible for its regulation.
Abstract: The major intrinsic protein (MIP) family is a family of proteins, including aquaporins, which facilitate water and small molecule transport across plasma membranes. In plants, MIPs function in a huge variety of processes including water transport, growth, stress response, and fruit development. In this study, we characterize the structure and transcriptional regulation of the MIP family in grapevine, describing the putative genome duplication events leading to the family structure and characterizing the family’s tissue and developmental specific expression patterns across numerous preexisting microarray and RNAseq datasets. Gene co-expression network (GCN) analyses were carried out across these datasets and the promoters of each family member were analyzed for cis-regulatory element structure in order to provide insight into their transcriptional regulation. A total of 29 Vitis vinifera MIP family members (excluding putative pseudogenes) were identified of which all but two were mapped onto Vitis vinifera chromosomes. In this study, segmental duplication events were identified for five plasma membrane intrinsic protein (PIP) and four tonoplast intrinsic protein (TIP) genes, contributing to the expansion of PIPs and TIPs in grapevine. Grapevine MIP family members have distinct tissue and developmental expression patterns and hierarchical clustering revealed two primary groups regardless of the datasets analyzed. Composite microarray and RNA-seq gene co-expression networks (GCNs) highlighted the relationships between MIP genes and functional categories involved in cell wall modification and transport, as well as with other MIPs revealing a strong co-regulation within the family itself. Some duplicated MIP family members have undergone sub-functionalization and exhibit distinct expression patterns and GCNs. Cis-regulatory element (CRE) analyses of the MIP promoters and their associated GCN members revealed enrichment for numerous CREs including AP2/ERFs and NACs. Combining phylogenetic analyses, gene expression profiling, gene co-expression network analyses, and cis-regulatory element enrichment, this study provides a comprehensive overview of the structure and transcriptional regulation of the grapevine MIP family. The study highlights the duplication and sub-functionalization of the family, its strong coordinated expression with genes involved in growth and transport, and the putative classes of TFs responsible for its regulation.