scispace - formally typeset
Search or ask a question
Author

Hélène Touzet

Bio: Hélène Touzet is an academic researcher from university of lille. The author has contributed to research in topics: Edit distance & RNA. The author has an hindex of 19, co-authored 59 publications receiving 2399 citations. Previous affiliations of Hélène Touzet include Laboratoire d'Informatique Fondamentale de Lille & Centre national de la recherche scientifique.


Papers
More filters
Journal ArticleDOI
TL;DR: SortMeRNA, a new software designed to rapidly filter rRNA fragments from metatranscriptomic data, is presented, capable of handling large sets of reads and sorting out all fragments matching to the rRNA database with high sensitivity and low running time.
Abstract: MOTIVATION: The application of Next-Generation Sequencing (NGS) technologies to RNAs directly extracted from a community of organisms yields a mixture of fragments characterizing both coding and non-coding types of RNAs. The tasks to distinguish among these and to further categorize the families of messenger RNAs and ribosomal RNAs is an important step for examining gene expression patterns of an interactive environment and the phylogenetic classification of the constituting species. RESULTS: We present SortMeRNA, a new software designed to rapidly filter ribosomal RNA fragments from metatranscriptomic data. It is capable of handling large sets of reads and sorting out all fragments matching to the rRNA database with high sensitivity and low running time. AVAILABILITY: http://bioinfo.lifl.fr/RNA/sortmerna CONTACT: evguenia.kopylova@lifl.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

1,868 citations

Journal ArticleDOI
TL;DR: A novel algorithm is described that achieves better performance in terms of computational time and precision than existing tools and is capable of calculating the exact P-value without any error even for matrices with non-integer coefficient values.
Abstract: Background: Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time. Results: The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with noninteger coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available. Conclusion: We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.

129 citations

Journal ArticleDOI
TL;DR: It is demonstrated that functions with exponential interpretation termination proofs are super-elementary, because they turn out to be exactly the deterministic (resp. non-deterministic) polynomial time, linear exponential time and linear doubly exponential time computable functions.
Abstract: We study the effect of polynomial interpretation termination proofs of deterministic (resp. non-deterministic) algorithms defined by con uent (resp. non-con uent) rewrite systems over data structures which include strings, lists and trees, and we classify them according to the interpretations of the constructors. This leads to the definition of six function classes which turn out to be exactly the deterministic (resp. non-deterministic) polynomial time, linear exponential time and linear doubly exponential time computable functions when the class is based on con uent (resp. non-con uent) rewrite systems. We also obtain a characterisation of the linear space computable functions. Finally, we demonstrate that functions with exponential interpretation termination proofs are super-elementary.

92 citations

Journal ArticleDOI
TL;DR: This work presents a tool for the prediction of conserved secondary structure elements of a family of homologous non-coding RNAs that successfully applies to datasets with low primary structure similarity and does not require any prior multiple sequence alignment.
Abstract: We present a tool for the prediction of conserved secondary structure elements of a family of homologous non-coding RNAs. Our method does not require any prior multiple sequence alignment. Thus, it successfully applies to datasets with low primary structure similarity. The functionality is demonstrated using three example datasets: sequences of RNase P RNAs, ciliate telomerases and enterovirus messenger RNAs. CARNAC has a web server that can be accessed at the URL http://bioinfo.lifl.fr/carnac.

78 citations

Journal ArticleDOI
TL;DR: It is shown that CARNAC provides a good partial prediction for a wide range of sequences and in presence of a whole family of sequences, can be used to detect whether the sequences actually share the same structure.
Abstract: Motivation: CARNAC is a new method for pairwise folding of RNA sequences. The program takes into account local similarity, stem energy, and covariations to produce the common folding. It can handle all RNA types, and has also been adapted to align a new homologous sequence along ar eference structured sequence. Results: Using different data sets, we show that CARNAC provides a good partial prediction for a wide range of sequences (16S ssu rRNA, RNase P RNA, viruses) with only two sequences. In presence of a whole family of sequences, we also show that CARNAC can be used to detect whether the sequences actually share the same structure. Availability: CARNAC is available at the URL http://www. lifl.fr/∼perrique/rna/

67 citations


Cited by
More filters
01 Jun 2012
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Abstract: The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online ( http://bioinf.spbau.ru/spades ). It is distributed as open source software.

10,124 citations

Journal ArticleDOI
05 May 2011-Nature
TL;DR: This study presents a general framework for deciphering cis-regulatory connections and their roles in disease, and maps nine chromatin marks across nine cell types to systematically characterize regulatory elements, their cell-type specificities and their functional interactions.
Abstract: Chromatin profiling has emerged as a powerful means of genome annotation and detection of regulatory activity. The approach is especially well suited to the characterization of non-coding portions of the genome, which critically contribute to cellular phenotypes yet remain largely uncharted. Here we map nine chromatin marks across nine cell types to systematically characterize regulatory elements, their cell-type specificities and their functional interactions. Focusing on cell-type-specific patterns of promoters and enhancers, we define multicell activity profiles for chromatin state, gene expression, regulatory motif enrichment and regulator expression. We use correlations between these profiles to link enhancers to putative target genes, and predict the cell-type-specific activators and repressors that modulate them. The resulting annotations and regulatory predictions have implications for the interpretation of genome-wide association studies. Top-scoring disease single nucleotide polymorphisms are frequently positioned within enhancer elements specifically active in relevant cell types, and in some cases affect a motif instance for a predicted regulator, thus suggesting a mechanism for the association. Our study presents a general framework for deciphering cis-regulatory connections and their roles in disease.

2,646 citations

Journal ArticleDOI
TL;DR: The results illustrate the importance of parameter tuning for optimizing classifier performance, and the recommendations regarding parameter choices for these classifiers under a range of standard operating conditions are made.
Abstract: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated “novel” marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ). Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.

2,475 citations

Journal ArticleDOI
TL;DR: A novel approach and database, RegulomeDB, which guides interpretation of regulatory variants in the human genome, which includes high-throughput, experimental data sets from ENCODE and other sources, as well as computational predictions and manual annotations to identify putative regulatory potential and identify functional variants.
Abstract: As the sequencing of healthy and disease genomes becomes more commonplace, detailed annotation provides interpretation for individual variation responsible for normal and disease phenotypes. Current approaches focus on direct changes in protein coding genes, particularly nonsynonymous mutations that directly affect the gene product. However, most individual variation occurs outside of genes and, indeed, most markers generated from genome-wide association studies (GWAS) identify variants outside of coding segments. Identification of potential regulatory changes that perturb these sites will lead to a better localization of truly functional variants and interpretation of their effects. We have developed a novel approach and database, RegulomeDB, which guides interpretation of regulatory variants in the human genome. RegulomeDB includes high-throughput, experimental data sets from ENCODE and other sources, as well as computational predictions and manual annotations to identify putative regulatory potential and identify functional variants. These data sources are combined into a powerful tool that scores variants to help separate functional variants from a large pool and provides a small set of putative sites with testable hypotheses as to their function. We demonstrate the applicability of this tool to the annotation of noncoding variants from 69 full sequenced genomes as well as that of a personal genome, where thousands of functionally associated variants were identified. Moreover, we demonstrate a GWAS where the database is able to quickly identify the known associated functional variant and provide a hypothesis as to its function. Overall, we expect this approach and resource to be valuable for the annotation of human genome sequences.

2,355 citations

01 Jan 2011
TL;DR: The sheer volume and scope of data posed by this flood of data pose a significant challenge to the development of efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.
Abstract: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.

2,187 citations