scispace - formally typeset
Search or ask a question
Author

Yonil Park

Bio: Yonil Park is an academic researcher from National Institute of Advanced Industrial Science and Technology. The author has contributed to research in topics: Alignment-free sequence analysis & Frameshift mutation. The author has an hindex of 1, co-authored 1 publications receiving 26 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics is described, suggesting that metagenomic analysis needs to use frameshIFT alignment to derive accurate results.
Abstract: Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two ‘post-genomic’ applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results. Availability and implementation: The statistical calculation is available in FALP (http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/index/software.html), and giga-scale frameshift alignment is available in LAST (http://last.cbrc.jp/falp). Contact: vog.hin.mln.ibcn@eguops or pj.crbc@nitram Supplementary information: Supplementary data are available at Bioinformatics online.

32 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A new LCA-based algorithm for taxonomic binning, and an interval-tree based algorithm for functional binning that are explicitly designed for long reads and assembled contigs are described, and the applicability of widely-used metagenomic analysis software MEGAN is extended to long reads.
Abstract: There are numerous computational tools for taxonomic or functional analysis of microbiome samples, optimized to run on hundreds of millions of short, high quality sequencing reads. Programs such as MEGAN allow the user to interactively navigate these large datasets. Long read sequencing technologies continue to improve and produce increasing numbers of longer reads (of varying lengths in the range of 10k-1M bps, say), but of low quality. There is an increasing interest in using long reads in microbiome sequencing, and there is a need to adapt short read tools to long read datasets. We describe a new LCA-based algorithm for taxonomic binning, and an interval-tree based algorithm for functional binning, that are explicitly designed for long reads and assembled contigs. We provide a new interactive tool for investigating the alignment of long reads against reference sequences. For taxonomic and functional binning, we propose to use LAST to compare long reads against the NCBI-nr protein reference database so as to obtain frame-shift aware alignments, and then to process the results using our new methods. All presented methods are implemented in the open source edition of MEGAN, and we refer to this new extension as MEGAN-LR (MEGAN long read). We evaluate the LAST+MEGAN-LR approach in a simulation study, and on a number of mock community datasets consisting of Nanopore reads, PacBio reads and assembled PacBio reads. We also illustrate the practical application on a Nanopore dataset that we sequenced from an anammox bio-rector community. This article was reviewed by Nicola Segata together with Moreno Zolfo, Pete James Lockhart and Serghei Mangul. This work extends the applicability of the widely-used metagenomic analysis software MEGAN to long reads. Our study suggests that the presented LAST+MEGAN-LR pipeline is sufficiently fast and accurate.

124 citations

Journal ArticleDOI
TL;DR: Although the compact nature of genes in phages is a problem for current gene annotators, PHANOTATE exploits this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible.
Abstract: Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank's non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE. Supplementary information Supplementary data are available at Bioinformatics online.

124 citations

Journal ArticleDOI
TL;DR: The Companion web server is developed providing parasite genome annotation as a service using a reference-based approach and the use and performance is demonstrated by annotating two Leishmania and Plasmodium genomes as typical parasite cases and compared to manually annotated references.
Abstract: Currently available sequencing technologies enable quick and economical sequencing of many new eukaryotic parasite (apicomplexan or kinetoplastid) species or strains. Compared to SNP calling approaches, de novo assembly of these genomes enables researchers to additionally determine insertion, deletion and recombination events as well as to detect complex sequence diversity, such as that seen in variable multigene families. However, there currently are no automated eukaryotic annotation pipelines offering the required range of results to facilitate such analyses. A suitable pipeline needs to perform evidence-supported gene finding as well as functional annotation and pseudogene detection up to the generation of output ready to be submitted to a public database. Moreover, no current tool includes quick yet informative comparative analyses and a first pass visualization of both annotation and analysis results. To overcome those needs we have developed the Companion web server (http://companion.sanger.ac.uk) providing parasite genome annotation as a service using a reference-based approach. We demonstrate the use and performance of Companion by annotating two Leishmania and Plasmodium genomes as typical parasite cases and evaluate the results compared to manually annotated references.

101 citations

Journal ArticleDOI
TL;DR: MetaMaps is a new method, specifically developed for long reads, capable of mapping a long-read metagenome to a comprehensive RefSeq database with >12,000 genomes in <16 GB or RAM on a laptop computer.
Abstract: Metagenomic sequence classification should be fast, accurate and information-rich. Emerging long-read sequencing technologies promise to improve the balance between these factors but most existing methods were designed for short reads. MetaMaps is a new method, specifically developed for long reads, capable of mapping a long-read metagenome to a comprehensive RefSeq database with >12,000 genomes in 94% accuracy for species-level read assignment and r2 > 0.97 for the estimation of sample composition on both simulated and real data when the sample genomes or close relatives are present in the classification database. To address novel species and genera, which are comparatively harder to predict, MetaMaps outputs mapping locations and qualities for all classified reads, enabling functional studies (e.g. gene presence/absence) and detection of incongruities between sample and reference genomes. Sequencing platforms, such as Oxford Nanopore or Pacific Biosciences generate long-read data that preserve long-range genomic information but have high error rates. Here, the authors develop MetaMaps, a computational tool for strain-level metagenomic assignment and compositional estimation using long reads.

89 citations

Journal ArticleDOI
TL;DR: It is shown that a regulatory pathway of the recipient R. solanacearum genome involved in extracellular infection of natural hosts was reused to improve intracellular symbiosis with the Mimosa pudica legume.
Abstract: JPC and CC were supported by the Initiative d’Excellence IDEX UNITI Actions Thematiques Strategiques program (RHIZOWHEAT 2014) and by the French National Research Agency (ANR-12-ADAP-0014-01) This work was supported by funds from the French National Institute for Agricultural Research (Plant Health and the Environment Division), the French National Research Agency (ANR-12-ADAP-0014-01) and the French Laboratory of Excellence project TULIP (ANR-10-LABX-41)

40 citations