scispace - formally typeset
Search or ask a question
Author

Christian J. Michel

Bio: Christian J. Michel is an academic researcher from University of Strasbourg. The author has contributed to research in topics: Genetic code & Genome. The author has an hindex of 25, co-authored 96 publications receiving 1786 citations. Previous affiliations of Christian J. Michel include Centre national de la recherche scientifique & University of Franche-Comté.
Topics: Genetic code, Genome, Population, Gene, Transfer RNA


Papers
More filters
Journal ArticleDOI
TL;DR: Surprisingly, the same three subsets of trinucleotides are identified in these two gene populations: Tzero, T1 and T2 replaced in the three two-letter genetic alphabets purine/pyrimidine, amino/ceto and strong/weak interaction.

185 citations

Journal ArticleDOI
TL;DR: The analysis provides evidence supporting the presence of overlapping ORFs 7b, 9b and 9c in all the genomes and thus helps to resolve some differences in current genome annotations and predicts that ORF3b is not functional in all genomes.
Abstract: The Covid19 infection is caused by the SARS-CoV-2 virus, a novel member of the coronavirus (CoV) family. CoV genomes code for a ORF1a / ORF1ab polyprotein and four structural proteins widely studied as major drug targets. The genomes also contain a variable number of open reading frames (ORFs) coding for accessory proteins that are not essential for virus replication, but appear to have a role in pathogenesis. The accessory proteins have been less well characterized and are difficult to predict by classical bioinformatics methods. We propose a computational tool GOFIX to characterize potential ORFs in virus genomes. In particular, ORF coding potential is estimated by searching for enrichment in motifs of the X circular code, that is known to be over-represented in the reading frames of viral genes. We applied GOFIX to study the SARS-CoV-2 and related genomes including SARS-CoV and SARS-like viruses from bat, civet and pangolin hosts, focusing on the accessory proteins. Our analysis provides evidence supporting the presence of overlapping ORFs 7b, 9b and 9c in all the genomes and thus helps to resolve some differences in current genome annotations. In contrast, we predict that ORF3b is not functional in all genomes. Novel putative ORFs were also predicted, including a truncated form of the ORF10 previously identified in SARS-CoV-2 and a little known ORF overlapping the Spike protein in Civet-CoV and SARS-CoV. Our findings contribute to characterizing sequence properties of accessory genes of SARS coronaviruses, and especially the newly acquired genes making use of overlapping reading frames.

131 citations

Journal ArticleDOI
TL;DR: These identified X circular code motifs and their mathematical properties may constitute a translation code involved in retrieval, maintenance and synchronization of reading frames in genes.

77 citations

Posted ContentDOI
30 May 2020-bioRxiv
TL;DR: The GOFIX analysis provides evidence supporting the presence of overlapping ORFs 7b, 9b and 9c in all the genomes and thus helps to resolve some differences in current genome annotations, and predicts that ORF3b is not functional in all genomes.
Abstract: The Covid19 infection is caused by the SARS-CoV-2 virus, a novel member of the coronavirus (CoV) family. CoV genomes code for a ORF1a / ORF1ab polyprotein and four structural proteins widely studied as major drug targets. The genomes also contain a variable number of open reading frames (ORFs) coding for accessory proteins that are not essential for virus replication, but appear to have a role in pathogenesis. The accessory proteins have been less well characterized and are difficult to predict by classical bioinformatics methods. We propose a computational tool GOFIX to characterize potential ORFs in virus genomes. In particular, ORF coding potential is estimated by searching for enrichment in motifs of the X circular code, that is known to be over-represented in the reading frames of viral genes. We applied GOFIX to study the SARS-CoV-2 and related genomes including SARS-CoV and SARS-like viruses from bat, civet and pangolin hosts, focusing on the accessory proteins. Our analysis provides evidence supporting the presence of overlapping ORFs 7b, 9b and 9c in all the genomes and thus helps to resolve some differences in current genome annotations. In contrast, we predict that ORF3b is not functional in all genomes. Novel putative ORFs were also predicted, including a truncated form of the ORF10 previously identified in SARS-CoV-2 and a little known ORF overlapping the Spike protein in Civet-CoV and SARS-CoV. Our findings contribute to characterizing sequence properties of accessory genes of SARS coronaviruses, and especially the newly acquired genes making use of overlapping reading frames.

62 citations

Journal ArticleDOI
TL;DR: The identification of X motifs and a gene circular code property in tRNAs strengthens the concept proposed in Michel (2012) of a possible translation (framing) code based on a circular code.

59 citations


Cited by
More filters
Journal ArticleDOI

3,734 citations

Journal ArticleDOI
TL;DR: Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences.
Abstract: The quality of multiple sequence alignments plays an important role in the accuracy of phylogenetic inference. It has been shown that removing ambiguously aligned regions, but also other sources of bias such as highly variable (saturated) characters, can improve the overall performance of many phylogenetic reconstruction methods. A current scientific trend is to build phylogenetic trees from a large number of sequence datasets (semi-)automatically extracted from numerous complete genomes. Because these approaches do not allow a precise manual curation of each dataset, there exists a real need for efficient bioinformatic tools dedicated to this alignment character trimming step. Here is presented a new software, named BMGE (Block Mapping and Gathering with Entropy), that is designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference. For each character, BMGE computes a score closely related to an entropy value. Calculation of these entropy-like scores is weighted with BLOSUM or PAM similarity matrices in order to distinguish among biologically expected and unexpected variability for each aligned character. Sets of contiguous characters with a score above a given threshold are considered as not suited for phylogenetic inference and then removed. Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences. BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity. BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences. Java source code and executable are freely available at ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/ .

1,080 citations

Journal ArticleDOI
TL;DR: This paper reviews and synthesizes the underlying coding measures from published algorithms and concludes that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures.
Abstract: A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures--functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of 'typical' exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Different measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.

407 citations

Journal Article
TL;DR: A compositional bias is identified as responsible for this inconsistency and it is reduced effectively by coding the nucleotides as purines and pyrimidines (RY-coding), reinforcing the original tree.
Abstract: Phylogenetic inference from sequences can be misled by both sampling (stochastic) error and systematic error (nonhistorical signals where reality differs from our simplified models). A recent study of eight yeast species using 106 concatenated genes from complete genomes showed that even small internal edges of a tree received 100% bootstrap support. This effective negation of stochastic error from large data sets is important, but longer sequences exacerbate the potential for biases (systematic error) to be positively misleading. Indeed, when we analyzed the same data set using minimum evolution optimality criteria, an alternative tree received 100% bootstrap support. We identified a compositional bias as responsible for this inconsistency and showed that it is reduced effectively by coding the nucleotides as purines and pyrimidines (RY-coding), reinforcing the original tree. Thus, a comprehensive exploration of potential systematic biases is still required, even though genome-scale data sets greatly reduce sampling error.

392 citations

Journal ArticleDOI
TL;DR: A lossless algorithm is presented, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes, which leads to the highest compression of DNA.
Abstract: Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of “text.” We analyze in some detail the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, leads to the highest compression of DNA. The results, although not satisfactory, give insight to the necessary correlation between compression and comprehension of genetic sequences.

302 citations