scispace - formally typeset
Search or ask a question

Showing papers on "Pseudogene published in 2021"


Journal ArticleDOI
TL;DR: In this article, the authors identified homologs of the genes underlying this phenotype, cifA and cifB, in 52 of 71 new and published Wolbachia genome sequences.
Abstract: Cytoplasmic incompatibility is a selfish reproductive manipulation induced by the endosymbiont Wolbachia in arthropods. In males Wolbachia modifies sperm, leading to embryonic mortality in crosses with Wolbachia-free females. In females, Wolbachia rescues the cross and allows development to proceed normally. This provides a reproductive advantage to infected females, allowing the maternally transmitted symbiont to spread rapidly through host populations. We identified homologs of the genes underlying this phenotype, cifA and cifB, in 52 of 71 new and published Wolbachia genome sequences. They are strongly associated with cytoplasmic incompatibility. There are up to seven copies of the genes in each genome, and phylogenetic analysis shows that Wolbachia frequently acquires new copies due to pervasive horizontal transfer between strains. In many cases, the genes have subsequently acquired loss-of-function mutations to become pseudogenes. As predicted by theory, this tends to occur first in cifB, whose sole function is to modify sperm, and then in cifA, which is required to rescue the cross in females. Although cif genes recombine, recombination is largely restricted to closely related homologs. This is predicted under a model of coevolution between sperm modification and embryonic rescue, where recombination between distantly related pairs of genes would create a self-incompatible strain. Together, these patterns of gene gain, loss, and recombination support evolutionary models of cytoplasmic incompatibility.

62 citations


Journal ArticleDOI
TL;DR: This review summarizes how PIWI proteins and piRNAs regulate the expression of many disparate RNAs, describing a highly complex global genomic regulatory relationship at the RNA level through whichpiRNAs functionally connect all major constituents of the genome in the germline.
Abstract: PIWI proteins, a subfamily of PAZ/PIWI Domain family RNA-binding proteins, are best known for their function in silencing transposons and germline development by partnering with small noncoding RNAs called PIWI-interacting RNAs (piRNAs). However, recent studies have revealed multifaceted roles of the PIWI-piRNA pathway in regulating the expression of other major classes of RNAs in germ cells. In this review, we summarize how PIWI proteins and piRNAs regulate the expression of many disparate RNAs, describing a highly complex global genomic regulatory relationship at the RNA level through which piRNAs functionally connect all major constituents of the genome in the germline.

43 citations


Journal ArticleDOI
TL;DR: The complexity of genomic alteration of GPCR genes as well as their functional consequences are highlighted and derived therapeutic approaches are discussed.
Abstract: There are approximately 800 annotated G protein-coupled receptor (GPCR) genes, making these membrane receptors members of the most abundant gene family in the human genome. Besides being involved in manifold physiologic functions and serving as important pharmacotherapeutic targets, mutations in 55 GPCR genes cause about 66 inherited monogenic diseases in humans. Alterations of nine GPCR genes are causatively involved in inherited digenic diseases. In addition to classic gain- and loss-of-function variants, other aspects, such as biased signaling, trans-signaling, ectopic expression, allele variants of GPCRs, pseudogenes, gene fusion, and gene dosage, contribute to the repertoire of GPCR dysfunctions. However, the spectrum of alterations and GPCR involvement is probably much larger because an additional 91 GPCR genes contain homozygous or hemizygous loss-of-function mutations in human individuals with currently unidentified phenotypes. This review highlights the complexity of genomic alteration of GPCR genes as well as their functional consequences and discusses derived therapeutic approaches. SIGNIFICANCE STATEMENT: With the advent of new transgenic and sequencing technologies, the number of monogenic diseases related to G protein-coupled receptor (GPCR) mutants has significantly increased, and our understanding of the functional impact of certain kinds of mutations has substantially improved. Besides the classical gain- and loss-of-function alterations, additional aspects, such as biased signaling, trans-signaling, ectopic expression, allele variants of GPCRs, uniparental disomy, pseudogenes, gene fusion, and gene dosage, need to be elaborated in light of GPCR dysfunctions and possible therapeutic strategies.

43 citations


Journal ArticleDOI
TL;DR: The analysis of the genomes of two Cuban species belonging to the genus Lucifuga provided evidence for the largest loss of eye-specific genes and nonvisual opsin genes reported so far in cavefishes, suggesting that blind fishes cannot thrive more than a few million years in cave ecosystems.
Abstract: Evolution sometimes proceeds by loss, especially when structures and genes become dispensable after an environmental shift relaxes functional constraints. Subterranean vertebrates are outstanding models to analyze this process, and gene decay can serve as a readout. We sought to understand some general principles on the extent and tempo of the decay of genes involved in vision, circadian clock, and pigmentation in cavefishes. The analysis of the genomes of two Cuban species belonging to the genus Lucifuga provided evidence for the largest loss of eye-specific genes and nonvisual opsin genes reported so far in cavefishes. Comparisons with a recently evolved cave population of Astyanax mexicanus and three species belonging to the Chinese tetraploid genus Sinocyclocheilus revealed the combined effects of the level of eye regression, time, and genome ploidy on eye-specific gene pseudogenization. The limited extent of gene decay in all these cavefishes and the very small number of loss-of-function mutations per pseudogene suggest that their eye degeneration may not be very ancient, ranging from early to late Pleistocene. This is in sharp contrast with the identification of several vision genes carrying many loss-of-function mutations in ancient fossorial mammals, further suggesting that blind fishes cannot thrive more than a few million years in cave ecosystems.

32 citations


Journal ArticleDOI
TL;DR: In this article, the authors performed whole-genome sequencing using long-read sequencing technology (Oxford Nanopore) for 11 Japanese liver cancers and matched normal samples which were previously sequenced for the International Cancer Genome Consortium (ICGC).
Abstract: Identification of germline variation and somatic mutations is a major issue in human genetics. However, due to the limitations of DNA sequencing technologies and computational algorithms, our understanding of genetic variation and somatic mutations is far from complete. In the present study, we performed whole-genome sequencing using long-read sequencing technology (Oxford Nanopore) for 11 Japanese liver cancers and matched normal samples which were previously sequenced for the International Cancer Genome Consortium (ICGC). We constructed an analysis pipeline for the long-read data and identified germline and somatic structural variations (SVs). In polymorphic germline SVs, our analysis identified 8004 insertions, 6389 deletions, 27 inversions, and 32 intra-chromosomal translocations. By comparing to the chimpanzee genome, we correctly inferred events that caused insertions and deletions and found that most insertions were caused by transposons and Alu is the most predominant source, while other types of insertions, such as tandem duplications and processed pseudogenes, are rare. We inferred mechanisms of deletion generations and found that most non-allelic homolog recombination (NAHR) events were caused by recombination errors in SINEs. Analysis of somatic mutations in liver cancers showed that long reads could detect larger numbers of SVs than a previous short-read study and that mechanisms of cancer SV generation were different from that of germline deletions. Our analysis provides a comprehensive catalog of polymorphic and somatic SVs, as well as their possible causes. Our software are available at https://github.com/afujimoto/CAMPHOR and https://github.com/afujimoto/CAMPHORsomatic .

27 citations


Journal ArticleDOI
TL;DR: In this paper, the onion line DHCU066619 was assembled into 14.9 Gb with an N50 of 464 Kb, of which 2.4 Gb was ordered into eight pseudomolecules using four genetic linkage maps and the remainder of the genome is available in 89.6 K scaffolds.
Abstract: Onion is an important vegetable crop with an estimated genome size of 16 Gb. We describe the de novo assembly and ab initio annotation of the genome of a doubled haploid onion line DHCU066619, which resulted in a final assembly of 14.9 Gb with an N50 of 464 Kb. Of this, 2.4 Gb was ordered into eight pseudomolecules using four genetic linkage maps. The remainder of the genome is available in 89.6 K scaffolds. Only 72.4% of the genome could be identified as repetitive sequences and consist, to a large extent, of (retro) transposons. In addition, an estimated 20% of the putative (retro) transposons had accumulated a large number of mutations, hampering their identification, but facilitating their assembly. These elements are probably already quite old. The ab initio gene prediction indicated 540,925 putative gene models, which is far more than expected, possibly due to the presence of pseudogenes. Of these models, 47,066 showed RNASeq support. No gene rich regions were found, genes are uniformly distributed over the genome. Analysis of synteny with Allium sativum (garlic) showed collinearity but also major rearrangements between both species. This assembly is the first high-quality genome sequence available for the study of onion and will be a valuable resource for further research.

22 citations


Posted ContentDOI
05 Mar 2021-bioRxiv
TL;DR: This assembly of the genome of a doubled haploid onion line DHCU066619, which resulted in a final assembly of 14.9 Gb, is the first high-quality genome sequence available for the study of onion and will be a valuable resource for further research.
Abstract: Onion is an important vegetable crop with an estimated genome size of 16Gb. We describe the de novo assembly and ab initio annotation of the genome of a doubled haploid onion line DHCU066619, which resulted in a final assembly of 14.9 Gb with a N50 of 461 Kb. Of this, 2.2 Gb was ordered into 8 pseudomolecules using five genetic linkage maps. The remainder of the genome is available in 89.8 K scaffolds. Only 72.4% of the genome could be identified as repetitive sequences and consist, to a large extent, of (retro) transposons. In addition, an estimated 20% of the putative (retro) transposons had accumulated a large number of mutations, hampering their identification, but facilitating their assembly. These elements are probably already quite old. The ab initio gene prediction indicated 540,925 putative gene models, which is far more than expected, possibly due to the presence of pseudogenes. Of these models, 86,073 showed similarity to published proteins (UNIPROT). No gene rich regions were found, genes are uniformly distributed over the genome. Analysis of synteny with A. sativum (garlic) showed collinearity but also major rearrangements between both species. This assembly is the first high-quality genome sequence available for the study of onion and will be a valuable resource for further research.

21 citations


Journal ArticleDOI
TL;DR: In this article, the authors summarized the discovery and recent advances in deciphering the regulatory role played by tRNA-derived RNA fragments in the pathophysiology of different human diseases and found that tRNAs are the second most abundant type of RNA in the cell.
Abstract: Hundreds of tRNA genes and pseudogenes are encoded by the human genome. tRNAs are the second most abundant type of RNA in the cell. Advancement in deep-sequencing technologies have revealed the presence of abundant expression of functional tRNA-derived RNA fragments (tRFs). They are either generated from precursor (pre-)tRNA or mature tRNA. They have been found to play crucial regulatory roles during different pathological conditions. Herein, we briefly summarize the discovery and recent advances in deciphering the regulatory role played by tRFs in the pathophysiology of different human diseases.

21 citations


Journal ArticleDOI
TL;DR: In this paper, the authors identify hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns and assess the biological impact of non-coding pseudogene transcripts, using CRISPR-Cas9 deletion and observation of hundreds of perturbed genes.
Abstract: Pseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.

19 citations


Posted ContentDOI
10 Jul 2021-bioRxiv
TL;DR: The authors' assembly with fully resolved chromosomes provides evidence of an evolutionary path taken to create the Z and W sex chromosomes of schistosomes, and sex-linked divergence of the single U2AF gene may have been a pivotal step in the evolution of gonorchorism and genotypic sex determination of schistsosomes.
Abstract: Background: Schistosoma mansoni is a flatworm that causes a neglected tropical disease affecting millions worldwide. Most flatworms are hermaphrodites but schistosomes have genotypically determined male (ZZ) and female (ZW) sexes. Sex is essential for pathology and transmission, however, the molecular determinants of sex remain unknown and is limited by poorly resolved sex chromosomes in previous genome assemblies. Results: We assembled the 391.4 Mb S. mansoni genome into individual, single-scaffold chromosomes, including Z and W. Manual curation resulted in a vastly improved gene annotation, resolved gene and repeat arrays, trans-splicing, and almost all UTRs. The sex chromosomes each comprise pseudoautosomal regions and single sex-specific regions. The Z-specific region contains 932 genes, but on W all but 29 of these genes have been lost and the presence of five pseudogenes indicates that degeneration of W is ongoing. Synteny analysis reveals an ancient chromosomal fusion corresponding to the oldest part of Z, where only a single gene-encoding the large subunit of pre-mRNA splicing factor U2AF has retained an intact copy on W. The sex-specific copies of U2AF have divergent N-termini and show sex-biased gene expression. Conclusion: Our assembly with fully resolved chromosomes provides evidence of an evolutionary path taken to create the Z and W sex chromosomes of schistosomes. Sex-linked divergence of the single U2AF gene, which has been present in the sex-specific regions longer than any other extant gene with distinct male and female specific copies and expression, may have been a pivotal step in the evolution of gonorchorism and genotypic sex determination of schistosomes.

17 citations


Journal ArticleDOI
TL;DR: This study highlights that IGs are essential modulators of regulatory processes, such as the Wnt signaling pathway and biological processes as pivotal as sensory organ developing at a transcriptional and post-translational level.
Abstract: The structure of eukaryotic genes is generally a combination of exons interrupted by intragenic non-coding DNA regions (introns) removed by RNA splicing to generate the mature mRNA. A fraction of genes, however, comprise a single coding exon with introns in their untranslated regions or are intronless genes (IGs), lacking introns entirely. The latter code for essential proteins involved in development, growth, and cell proliferation and their expression has been proposed to be highly specialized for neuro-specific functions and linked to cancer, neuropathies, and developmental disorders. The abundant presence of introns in eukaryotic genomes is pivotal for the precise control of gene expression. Notwithstanding, IGs exempting splicing events entail a higher transcriptional fidelity, making them even more valuable for regulatory roles. This work aimed to infer the functional role and evolutionary history of IGs centered on the mouse genome. IGs consist of a subgroup of genes with one exon including coding genes, non-coding genes, and pseudogenes, which conform approximately 6% of a total of 21,527 genes. To understand their prevalence, biological relevance, and evolution, we identified and studied 1,116 IG functional proteins validating their differential expression in transcriptomic data of embryonic mouse telencephalon. Our results showed that overall expression levels of IGs are lower than those of MEGs. However, strongly up-regulated IGs include transcription factors (TFs) such as the class 3 of POU (HMG Box), Neurog1, Olig1, and BHLHe22, BHLHe23, among other essential genes including the β-cluster of protocadherins. Most striking was the finding that IG-encoded BHLH TFs fit the criteria to be classified as microproteins. Finally, predicted protein orthologs in other six genomes confirmed high conservation of IGs associated with regulating neural processes and with chromatin organization and epigenetic regulation in Vertebrata. Moreover, this study highlights that IGs are essential modulators of regulatory processes, such as the Wnt signaling pathway and biological processes as pivotal as sensory organ developing at a transcriptional and post-translational level. Overall, our results suggest that IG proteins have specialized, prevalent, and unique biological roles and that functional divergence between IGs and MEGs is likely to be the result of specific evolutionary constraints.

Journal ArticleDOI
TL;DR: In this article, the authors outline the creation of a valuable nORFs data set with experimental evidence of translation for the community, use measures of heritability and selection that reveal signals for functional importance, and show the potential implications for functional interpretation of genetic variants in nORF.
Abstract: Recent evidence from proteomics and deep massively parallel sequencing studies have revealed that eukaryotic genomes contain substantial numbers of as-yet-uncharacterized open reading frames (ORFs). We define these uncharacterized ORFs as novel ORFs (nORFs). nORFs in humans are mostly under 100 codons and are found in diverse regions of the genome, including in long noncoding RNAs, pseudogenes, 3′ UTRs, 5′ UTRs, and alternative reading frames of canonical protein coding exons. There is therefore a pressing need to evaluate the potential functional importance of these unannotated transcripts and proteins in biological pathways and human disease on a larger scale, rather than one at a time. In this study, we outline the creation of a valuable nORFs data set with experimental evidence of translation for the community, use measures of heritability and selection that reveal signals for functional importance, and show the potential implications for functional interpretation of genetic variants in nORFs. Our results indicate that some variants that were previously classified as being benign or of uncertain significance may have to be reinterpreted.

Journal ArticleDOI
TL;DR: In this paper, the authors discuss the prevalence, molecular mechanisms, and functional evidence for androgen-regulated prostate cancer fusion genes and transcripts and discuss the clinical relevance of especially the most common prostate cancer fusions gene TMPRSS2-ERG.
Abstract: Androgens are steroid hormones governing the male reproductive development and function As such, androgens and the key mediator of their effects, androgen receptor (AR), have a leading role in many diseases Prostate cancer is a major disease where AR and its transcription factor function affect a significant number of patients worldwide While disease-related AR-driven transcriptional programs are connected to the presence and activity of the receptor itself, also novel modes of transcriptional regulation by androgens are exploited by cancer cells One of the most intriguing and ingenious mechanisms is to bring previously unconnected genes under the control of AR Most often this occurs through genetic rearrangements resulting in fusion genes where an androgen-regulated promoter area is combined to a protein-coding area of a previously androgen-unaffected gene These gene fusions are distinctly frequent in prostate cancer compared to other common solid tumors, a phenomenon still requiring an explanation Interestingly, also another mode of connecting androgen regulation to a previously unaffected gene product exists via transcriptional read-through mechanisms Furthermore, androgen regulation of fusion genes and transcripts is not linked to only protein-coding genes Pseudogenes and non-coding RNAs (ncRNAs), including long non-coding RNAs (lncRNAs) can also be affected by androgens and de novo functions produced In this review, we discuss the prevalence, molecular mechanisms, and functional evidence for androgen-regulated prostate cancer fusion genes and transcripts We also discuss the clinical relevance of especially the most common prostate cancer fusion gene TMPRSS2-ERG, as well as present open questions of prostate cancer fusions requiring further investigation

Journal ArticleDOI
TL;DR: In this paper, the authors developed a method to detect pseudogene sequences in DNA barcoding and metabarcoding analysis using hidden Markov model profile analysis (HMM) profile analysis.
Abstract: Pseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this study is to develop a method to screen for nuclear mitochondrial DNA segments (nuMTs) in large COI datasets. We do this by: (1) describing gene and nuMT characteristics from an artificial COI barcode dataset, (2) show the impact of two different pseudogene removal methods on perturbed community datasets with simulated nuMTs, and (3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. Open reading frame length and sequence bit scores from hidden Markov model (HMM) profile analysis were used to detect pseudogenes. Our simulations showed that it was more difficult to identify nuMTs from shorter amplicon sequences such as those typically used in metabarcoding compared with full length DNA barcodes that are used in the construction of barcode libraries. It was also more difficult to identify nuMTs in datasets where there is a high percentage of nuMTs. Existing bioinformatic pipelines used to process metabarcode sequences already remove some nuMTs, especially in the rare sequence removal step, but the addition of a pseudogene filtering step can remove up to 5% of sequences even when other filtering steps are in place. Open reading frame length filtering alone or combined with hidden Markov model profile analysis can be used to effectively screen out apparent pseudogenes from large datasets. There is more to learn from COI nuMTs such as their frequency in DNA barcoding and metabarcoding studies, their taxonomic distribution, and evolution. Thus, we encourage the submission of verified COI nuMTs to public databases to facilitate future studies.

Journal ArticleDOI
01 Feb 2021
TL;DR: In this paper, the SIGLEC-XII gene was found to facilitate human carcinoma progression, correlating with known tumorigenic signatures of Shp2-dependent cancers.
Abstract: Compared with our closest living evolutionary cousins, humans appear unusually prone to develop carcinomas (cancers arising from epithelia). The SIGLEC12 gene, which encodes the Siglec-XII protein expressed on epithelial cells, has several uniquely human features: a fixed homozygous missense mutation inactivating its natural ligand recognition property; a polymorphic frameshift mutation eliminating full-length protein expression in ~60%-70% of worldwide human populations; and, genomic features suggesting a negative selective sweep favoring the pseudogene state. Despite the loss of canonical sialic acid binding, Siglec-XII still recruits Shp2 and accelerates tumor growth in a mouse model. We hypothesized that dysfunctional Siglec-XII facilitates human carcinoma progression, correlating with known tumorigenic signatures of Shp2-dependent cancers. Immunohistochemistry was used to detect Siglec-XII expression on tissue microarrays. PC-3 prostate cancer cells were transfected with Siglec-XII and transcription of genes enriched with Siglec-XII was determined. Genomic SIGLEC12 status was determined for four different cancer cohorts. Finally, a dot blot analysis of human urinary epithelial cells was established to determine the Siglec-XII expressors versus non-expressors. Forced expression in a SIGLEC12 null carcinoma cell line enriched transcription of genes associated with cancer progression. While Siglec-XII was detected as expected in ~30%-40% of normal epithelia, ~80% of advanced carcinomas showed strong expression. Notably, >80% of late-stage colorectal cancers had a functional SIGLEC12 allele, correlating with overall increased mortality. Thus, advanced carcinomas are much more likely to occur in individuals whose genomes have an intact SIGLEC12 gene, likely because the encoded Siglec-XII protein recruits Shp2-related oncogenic pathways. The finding has prognostic, diagnostic, and therapeutic implications.


Journal ArticleDOI
06 Jun 2021-Genes
TL;DR: In this article, the Translocase of Outer Mitochondria Membrane 40 (TOMM40) gene may contribute to the risk of Alzheimer's disease (AD).
Abstract: Increasing evidence suggests that the Translocase of Outer Mitochondria Membrane 40 (TOMM40) gene may contribute to the risk of Alzheimer's disease (AD). Currently, there is no consensus as to whether TOMM40 expression is up- or down-regulated in AD brains, hindering a clear interpretation of TOMM40's role in this disease. The aim of this study was to determine if TOMM40 RNA levels differ between AD and control brains. We applied RT-qPCR to study TOMM40 transcription in human postmortem brain (PMB) and assessed associations of these RNA levels with genetic variants in APOE and TOMM40. We also compared TOMM40 RNA levels with mitochondrial functions in human cell lines. Initially, we found that the human genome carries multiple TOMM40 pseudogenes capable of producing highly homologous RNAs that can obscure precise TOMM40 RNA measurements. To circumvent this obstacle, we developed a novel RNA expression assay targeting the primary transcript of TOMM40. Using this assay, we showed that TOMM40 RNA was upregulated in AD PMB. Additionally, elevated TOMM40 RNA levels were associated with decreases in mitochondrial DNA copy number and mitochondrial membrane potential in oxidative stress-challenged cells. Overall, differential transcription of TOMM40 RNA in the brain is associated with AD and could be an indicator of mitochondrial dysfunction.

Journal ArticleDOI
TL;DR: In this paper, the authors reviewed OCT4 known transcripts, isoforms and pseudogenes, as well as its interactions with other proteins, and emphasize the importance of discriminating each of them in order to understand the exact function of OCT4.
Abstract: OCT4 plays critical roles in self-renewal and pluripotency maintenance of embryonic stem cells, and is considered as one of the main stemness markers. It also has pivotal roles in early stages of embryonic development. Most studies on OCT4 have focused on the expression and function of OCT4A, which is the biggest isoform of OCT4 known so far. Recently, many studies have shown that OCT4 has various transcript variants, protein isoforms, as well as pseudogenes. Distinguishing the expression and function of these variants and isoforms is a big challenge in expression profiling studies of OCT4. Understanding how OCT4 is functioning in different contexts, depends on knowing of where and when each of OCT4 transcripts, isoforms and pseudogenes are expressed. Here, we review OCT4 known transcripts, isoforms and pseudogenes, as well as its interactions with other proteins, and emphasize the importance of discriminating each of them in order to understand the exact function of OCT4 in stem cells, normal development and development of diseases.

Journal ArticleDOI
TL;DR: Camellia represents a group, where rDNA is subjected to a mixture of concerted and birth-and-death evolution, and some rRNA pseudogenes may still have potential functions but can be eliminated from the genome when released from selection constraint.

Posted ContentDOI
16 Feb 2021-bioRxiv
TL;DR: In this article, the authors compared the performance of the most common alignment tools for scRNA-seq data, including Cell Ranger 5, STARsolo, Kallisto and Alevin, on three published datasets for human and mouse, sequenced with different versions of the 10X sequencing protocol.
Abstract: With the rise of single cell RNA sequencing new bioinformatic tools became available to handle specific demands, such as quantifying unique molecular identifiers and correcting cell barcodes. Here, we analysed several datasets with the most common alignment tools for scRNA-seq data. We evaluated differences in the whitelisting, gene quantification, overall performance and potential variations in clustering or detection of differentially expressed genes. We compared the tools Cell Ranger 5, STARsolo, Kallisto and Alevin on three published datasets for human and mouse, sequenced with different versions of the 10X sequencing protocol. Striking differences have been observed in the overall runtime of the mappers. Besides that Kallisto and Alevin showed variances in the number of valid cells and detected genes per cell. Kallisto reported the highest number of cells, however, we observed an overrepresentation of cells with low gene content and unknown celtype. Conversely, Alevin rarely reported such low content cells. Further variations were detected in the set of expressed genes. While STARsolo, Cell Ranger 5 and Alevin released similar gene sets, Kallisto detected additional genes from the Vmn and Olfr gene family, which are likely mapping artifacts. We also observed differences in the mitochondrial content of the resulting cells when comparing a prefiltered annotation set to the full annotation set that includes pseudogenes and other biotypes. Overall, this study provides a detailed comparison of common scRNA-seq mappers and shows their specific properties on 10X Genomics data. Key messages Mapping and gene quantifications are the most resource and time intensive steps during the analysis of scRNA-Seq data. The usage of alternative alignment tools reduces the time for analysing scRNA-Seq data. Different mapping strategies influence key properties of scRNA-SEQ e.g. total cell counts or genes per cell A better understanding of advantages and disadvantages for each mapping algorithm might improve analysis results.

Journal ArticleDOI
TL;DR: In this article, the TCR loci were annotated using an improved genome assembly (ARS1) of a highly homozygous San Clemente goat and compared to cattle, finding that TCRγ and TCRδ were similarly organized in goats as in cattle and the gene sequences were highly conserved.
Abstract: Goats and cattle diverged 30 million years ago but retain similarities in immune system genes. Here, the caprine T cell receptor (TCR) gene loci and transcription of its genes were examined and compared to cattle. We annotated the TCR loci using an improved genome assembly (ARS1) of a highly homozygous San Clemente goat. This assembly has already proven useful for describing other immune system genes including antibody and leucocyte receptors. Both the TCRγ (TRG) and TCRδ (TRD) loci were similarly organized in goats as in cattle and the gene sequences were highly conserved. However, the number of genes varied slightly as a result of duplications and differences occurred in mutations resulting in pseudogenes. WC1+ γδ T cells in cattle have been shown to use TCRγ genes from only one of the six available cassettes. The structure of that Cγ gene product is unique and may be necessary to interact with WC1 for signal transduction following antigen ligation. Using RT-PCR and PacBio sequencing, we observed the same restriction for goat WC1+ γδ T cells. In contrast, caprine WC1+ and WC1- γδ T cell populations had a diverse TCRδ gene usage although the propensity for particular gene usage differed between the two cell populations. Noncanonical recombination signal sequences (RSS) largely correlated with restricted expression of TCRγ and δ genes. Finally, caprine γδ T cells were found to incorporate multiple TRD diversity gene sequences in a single transcript, an unusual feature among mammals but also previously observed in cattle.

Journal ArticleDOI
20 Apr 2021-Gene
TL;DR: In this paper, the authors identified the whole set of olfactory receptor genes in representative teleosts and found a significant contraction in common carp when compared with other teleost species.

Journal ArticleDOI
20 Jan 2021-Genes
TL;DR: In this paper, the authors investigated the association of DNA methylation levels with SSc in dermal fibroblasts from patients of African ancestry, and found that differential methylation occurs in Dermal fibroid cells from 15 SSc patients and 15 controls with African ancestry.
Abstract: The etiology and reasons underlying the ethnic disparities in systemic sclerosis (SSc) remain unknown. African Americans are disproportionally affected by SSc and yet are underrepresented in research. The aim of this study was to comprehensively investigate the association of DNA methylation levels with SSc in dermal fibroblasts from patients of African ancestry. Reduced representation bisulfite sequencing (RRBS) was performed on primary dermal fibroblasts from 15 SSc patients and 15 controls of African ancestry, and over 3.8 million CpG sites were tested for differential methylation patterns between cases and controls. The dermal fibroblasts from African American patients exhibited widespread reduced DNA methylation. Differentially methylated CpG sites were most enriched in introns and intergenic regions while depleted in 5′ UTR, promoters, and CpG islands. Seventeen genes and eleven promoters showed significant differential methylation, mostly in non-coding RNA genes and pseudogenes. Gene set enrichment analysis (GSEA) and gene ontology (GO) analyses revealed an enrichment of pathways related to interferon signaling and mesenchymal differentiation. The hypomethylation of DLX5 and TMEM140 was accompanied by these genes’ overexpression in patients but underexpression for lncRNA MGC12916. These data show that differential methylation occurs in dermal fibroblasts from African American patients with SSc and identifies novel coding and non-coding genes.

Journal ArticleDOI
TL;DR: In this article, a set of 95 MHC genomic sequences downloaded from a publicly available BioProject database at NCBI were used to identify and characterise polymorphic human leukocyte antigen (HLA) class I genes and pseudogenes, MICA and MICB, and retroelement indels as haplotypic lineage markers, and single-nucleotide polymorphism (SNP) crossover loci in DNA sequence alignments of different haplotypes across the Olfactory Receptor (OR) gene region (~1.2 Mb) and the MHC class I region (~
Abstract: The genomic region (~4 Mb) of the human major histocompatibility complex (MHC) on chromosome 6p21 is a prime model for the study and understanding of conserved polymorphic sequences (CPSs) and structural diversity of ancestral haplotypes (AHs)/conserved extended haplotypes (CEHs). The aim of this study was to use a set of 95 MHC genomic sequences downloaded from a publicly available BioProject database at NCBI to identify and characterise polymorphic human leukocyte antigen (HLA) class I genes and pseudogenes, MICA and MICB, and retroelement indels as haplotypic lineage markers, and single-nucleotide polymorphism (SNP) crossover loci in DNA sequence alignments of different haplotypes across the Olfactory Receptor (OR) gene region (~1.2 Mb) and the MHC class I region (~1.8 Mb) from the GPX5 to the MICB gene. Our comparative sequence analyses confirmed the identity of 12 haplotypic retroelement markers and revealed that they partitioned the HLA-A/B/C haplotypes into distinct evolutionary lineages. Crossovers between SNP-poor and SNP-rich regions defined the sequence range of haplotype blocks, and many of these crossover junctions occurred within particular transposable elements, lncRNA, OR12D2, MUC21, MUC22, PSORS1A3, HLA-C, HLA-B, and MICA. In a comparison of more than 250 paired sequence alignments, at least 38 SNP-density crossover sites were mapped across various regions from GPX5 to MICB. In a homology comparison of 16 different haplotypes, seven CEH/AH (7.1, 8.1, 18.2, 51.x, 57.1, 62.x, and 62.1) had no detectable SNP-density crossover junctions and were SNP poor across the entire ~2.8 Mb of sequence alignments. Of the analyses between different recombinant haplotypes, more than half of them had SNP crossovers within 10 kb of LTR16B/ERV3-16A3_I, MLT1, Charlie, and/or THE1 sequences and were in close vicinity to structurally polymorphic Alu and SVA insertion sites. These studies demonstrate that (1) SNP-density crossovers are associated with putative ancestral recombination sites that are widely spread across the MHC class I genomic region from at least the telomeric OR12D2 gene to the centromeric MICB gene and (2) the genomic sequences of MHC homozygous cell lines are useful for analysing haplotype blocks, ancestral haplotypic landscapes and markers, CPSs, and SNP-density crossover junctions.

Book ChapterDOI
TL;DR: In this paper, the authors describe pseudogenes that fulfill their role as diagnostic or prognostic biomarkers, both as unique elements and in collaboration with other genes or pseudogenees.
Abstract: Pseudogenes are commonly labeled as "junk DNA" given their perceived nonfunctional status. However, the advent of large-scale genomics projects prompted a revisit of pseudogene biology, highlighting their key functional and regulatory roles in numerous diseases, including cancers. Integrative analyses of cancer data have shown that pseudogenes can be transcribed and even translated, and that pseudogenic DNA, RNA, and proteins can interfere with the activity and function of key protein coding genes, acting as regulators of oncogenes and tumor suppressors. Capitalizing on the available clinical research, we are able to get an insight into the spread and variety of pseudogene biomarker and therapeutic potential. In this chapter, we describe pseudogenes that fulfill their role as diagnostic or prognostic biomarkers, both as unique elements and in collaboration with other genes or pseudogenes. We also report that the majority of prognostic pseudogenes are overexpressed and exert an oncogenic role in colorectal, liver, lung, and gastric cancers. Finally, we highlight a number of pseudogenes that can establish future therapeutic avenues.

Journal ArticleDOI
TL;DR: Molecular genetic analysis of CYP21A2 variants is of major importance for confirmation of clinical diagnosis, predicting prognosis and for an appropriate genetic counselling in CAH patients performed in the department.
Abstract: Congenital Adrenal Hyperplasia is a group of genetic autosomal recessive disorders that affects adrenal steroidogenesis in the adrenal cortex. One of the most common defects associated with Congenital Adrenal Hyperplasia is the deficiency of 21-hydroxylase enzyme, responsible for the conversion of 17-hydroxyprogesterone to 11-deoxycortisol and progesterone to deoxycorticosterone. The impairment of cortisol and aldosterone production is directly related to the clinical form of the disease that ranges from classic or severe to non-classic or mild late onset. The deficiency of 21-hydroxylase enzyme results from pathogenic variants on CYP21A2 gene that, in the majority of the cases, compromise enzymatic activity and are strongly correlated with the clinical severity of the disease. Due to the exceptionally high homology and proximity between the gene and the pseudogene, more than 90% of pathogenic variants result from intergenic recombination. Around 75% are deleterious variants transferred from the pseudogene by gene conversion, during mitosis. About 20% are due to unequal crossing over during meiosis and lead to duplications or deletions on CYP21A2 gene. Molecular genetic analysis of CYP21A2 variants is of major importance for confirmation of clinical diagnosis, predicting prognosis and for an appropriate genetic counselling. In this review we will present an update on the genetic analysis of CYP21A2 gene variants in CAH patients performed in our department.

Posted ContentDOI
08 Oct 2021-bioRxiv
TL;DR: Pseudofinder as discussed by the authors is an open-source software dedicated to pseudogene identification and analysis, which can detect a wide variety of pseudogenes, including those that are highly degraded and typically missed by gene-calling pipelines, as well newly formed pseudogenees which can have only one or a few inactivating mutations.
Abstract: Prokaryotic genomes are generally gene dense and encode relatively few pseudogenes, or nonfunctional/inactivated remnants of genes. However, in certain contexts, such as recent ecological shifts or extreme population bottlenecks (such as those experienced by symbionts and pathogens), pseudogenes can quickly accumulate and form a substantial fraction of the genome. Identification of pseudogenes is, thus, a critical step for understanding the evolutionary forces acting upon, and the functional potential encoded within, prokaryotic genomes. Here, we present Pseudofinder, an open-source software dedicated to pseudogene identification and analysis. With Pseudofinder9s multi-pronged, reference-based approach, we demonstrate its capacity to detect a wide variety of pseudogenes, including those that are highly degraded and typically missed by gene-calling pipelines, as well newly formed pseudogenes, which can have only one or a few inactivating mutations. Additionally, Pseudofinder can detect intact genes undergoing relaxed selection, which may indicate incipient pseudogene formation. Implementation of Pseudofinder in annotation pipelines will not only clarify the functional potential of sequenced microbes, but will also generate novel insights and hypotheses regarding the evolutionary dynamics of bacterial and archaeal genomes.

Journal ArticleDOI
TL;DR: In this paper, a more sensitive and accurate method was proposed to identify processed pseudogenes. But this method relies on detecting discordant mappings of paired-end short reads, or exon junctions contained in short reads.
Abstract: LINE-1-mediated retrotransposition of protein-coding mRNAs is an active process in modern humans for both germline and somatic genomes. Prior works that surveyed human data mostly relied on detecting discordant mappings of paired-end short reads, or exon junctions contained in short reads. Moreover, there have been few genome-wide comparisons between gene retrocopies in great apes and humans. In this study, we introduced a more sensitive and accurate method to identify processed pseudogenes. Our method utilizes long-read assemblies, and more importantly, is able to provide full-length retrocopy sequences as well as flanking regions which are missed by short-read based methods. From 22 human individuals, we pinpointed 40 processed pseudogenes that are not present in the human reference genome GRCh38 and identified 17 pseudogenes that are in GRCh38 but absent from some input individuals. This represents a significantly higher discovery rate than previous reports (39 pseudogenes not in the reference genome out of 939 individuals). We also provided an overview of lineage-specific retrocopies in chimpanzee, gorilla, and orangutan genomes.

Posted ContentDOI
26 Jan 2021-bioRxiv
TL;DR: In this article, the authors describe gene and pseudogene characteristics from a simulated DNA barcode dataset, show the impact of two different pseudogene removal methods on mock metabarcode datasets with simulated pseudogenes, and incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcodes sequences.
Abstract: Background Pseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this study is to develop a method to screen for obvious pseudogenes in large COI metabarcode datasets. We do this by: 1) describing gene and pseudogene characteristics from a simulated DNA barcode dataset, 2) show the impact of two different pseudogene removal methods on mock metabarcode datasets with simulated pseudogenes, and 3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. Open reading frame length and sequence bit scores from hidden Markov model (HMM) profile were used to detect pseudogenes. Results Our simulations showed that it was more difficult to identify pseudogenes from shorter amplicon sequences such as those typically used in metabarcoding (∼300 bp) compared with full length DNA barcodes that are used in construction of barcode libraries (∼ 650 bp). It was also more difficult to identify pseudogenes in datasets where there is a high percentage of pseudogene sequences. We show that existing bioinformatic pipelines used to process metabarcode sequences already remove some apparent pseudogenes, especially in the rare sequence removal step, but the addition of a pseudogene filtering step can remove more. Conclusions The combination of open reading frame length and hidden Markov model profile analysis can be used to effectively screen out obvious pseudogenes from large datasets. There is more to learn from COI pseudogenes such as their frequency in DNA barcode and metabarcoding studies, their taxonomic distribution, and evolution. Thus, we encourage the submission of verified COI pseudogenes to public databases to facilitate future studies.

Journal ArticleDOI
TL;DR: This study provides an insight into the redundancy of CAZymes for potential cold adaptation and suggests that the isolate PAMC25564 has glycogen, trehalose, and maltodextrin pathways associated to CAZyme genes, which allow Arthrobacter species to establish a symbiotic relationship with other bacteria in cold environments or live independently.
Abstract: The Arthrobacter group is a known set of bacteria from cold regions, the species of which are highly likely to play diverse roles at low temperatures. However, their survival mechanisms in cold regions such as Antarctica are not yet fully understood. In this study, we compared the genomes of 16 strains within the Arthrobacter group, including strain PAMC25564, to identify genomic features that help it to survive in the cold environment. Using 16 S rRNA sequence analysis, we found and identified a species of Arthrobacter isolated from cryoconite. We designated it as strain PAMC25564 and elucidated its complete genome sequence. The genome of PAMC25564 is composed of a circular chromosome of 4,170,970 bp with a GC content of 66.74 % and is predicted to include 3,829 genes of which 3,613 are protein coding, 147 are pseudogenes, 15 are rRNA coding, and 51 are tRNA coding. In addition, we provide insight into the redundancy of the genes using comparative genomics and suggest that PAMC25564 has glycogen and trehalose metabolism pathways (biosynthesis and degradation) associated with carbohydrate active enzyme (CAZymes). We also explain how the PAMC26654 produces energy in an extreme environment, wherein it utilizes polysaccharide or carbohydrate degradation as a source of energy. The genetic pattern analysis of CAZymes in cold-adapted bacteria can help to determine how they adapt and survive in such environments. We have characterized the complete Arthrobacter sp. PAMC25564 genome and used comparative analysis to provide insight into the redundancy of its CAZymes for potential cold adaptation. This provides a foundation to understanding how the Arthrobacter strain produces energy in an extreme environment, which is by way of CAZymes, consistent with reports on the use of these specialized enzymes in cold environments. Knowledge of glycogen metabolism and cold adaptation mechanisms in Arthrobacter species may promote in-depth research and subsequent application in low-temperature biotechnology.