scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Minimap2: pairwise alignment for nucleotide sequences

Heng Li1
15 Sep 2018-Bioinformatics (Bioinformatics)-Vol. 34, Iss: 18, pp 3094-3100
TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.
Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.
Citations
More filters
Journal ArticleDOI
TL;DR: Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold.
Abstract: Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.

2,261 citations


Cites methods from "Minimap2: pairwise alignment for nu..."

  • ...A similar minimizer-based approach has proven useful in accelerating read alignment [16]....

    [...]

01 Jan 2011
TL;DR: The sheer volume and scope of data posed by this flood of data pose a significant challenge to the development of efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.
Abstract: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.

2,187 citations

Journal ArticleDOI
14 May 2020-Cell
TL;DR: Functional investigation of the unknown transcripts and RNA modifications discovered in this study will open new directions to the understanding of the life cycle and pathogenicity of SARS-CoV-2.

1,626 citations


Cites methods from "Minimap2: pairwise alignment for nu..."

  • ...The sequence reads were aligned to the reference sequence database composed of the C. sabaeus genome (ENSEMBL release 99), a SARS-CoV-2 genome, yeast ENO2 cDNA (SGD: YHR174W), and human ribosomal DNA complete repeat unit (GenBank: U13369.1) using minimap2 2.17 (Li, 2018) with options ''-k 13 -x splice -N 32 -un.'' We used the sequence of the Wuhan-Hu-1 strain (GenBank: NC_045512.2) as a backbone for the viral reference genome, then corrected the four single nucleotide variants found in BetaCoV/Korea/KCDC03/2020; T4402C, G5062T, C8782T, and T28143C (GISAID: EPI_ISL_407193)....

    [...]

  • ...1) using minimap2 2.17 (Li, 2018) with options ‘‘-k 13 -x splice -...

    [...]

  • ...…and Algorithms guppy 3.4.5 Oxford Nanopore Technologies https://community.nanoporetech.com/ sso/login?next_url=%2Fdownloads minimap2 2.17 Li, 2018 https://github.com/lh3/minimap2 poreplex 0.5.0 Hyeshik Chang, Seoul National University,…...

    [...]

Journal ArticleDOI
03 Sep 2020-Cell
TL;DR: It is found that a substantial number of mutations to the RBD are well tolerated or even enhance ACE2 binding, including at ACE2 interface residues that vary across SARS-related coronaviruses.

1,517 citations


Cites background or methods from "Minimap2: pairwise alignment for nu..."

  • ...To do this, we used alignparse (Crawford and Bloom, 2019), version 0.1.3, which in turn makes use of minimap2 (Li, 2018), version 2.17....

    [...]

  • ...3, which in turn makes use of minimap2 (Li, 2018), version 2....

    [...]

  • ...…version 0.1.3 Crawford and Bloom, 2019 https://github.com/jbloomlab/alignparse minimap, version 2.17 Li 2018 https://github.com/lh3/minimap2 dms_variants, version 0.6.0 GitHub https://jbloomlab.github.io/dms_variants/ custom code This paper all…...

    [...]

Journal ArticleDOI
TL;DR: The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance, and users should consider producing a custom model using a larger neural network and/or training data from the same species.
Abstract: Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences (‘polishing’) with Nanopolish somewhat negates the accuracy differences in basecallers, but pre-polish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last 2 years. The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

1,488 citations


Cites methods from "Minimap2: pairwise alignment for nu..."

  • ...0 (the current version at the time of read selection), aligning the resulting reads (using minimap2 [18] v2....

    [...]

  • ...To assess read accuracy, we aligned each basecalled read set to the reference INF032 genome using minimap2 [18] (v2....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A novel approach to spliced alignment that meaningfully combined information from sequence similarity with that obtained from PSSM splice site models was taken and the resultant program, EXALIN, performed better than other popular tools tested under a wide range of conditions.
Abstract: Motivation: mRNA sequences and expressed sequence tags represent some of the most abundant experimental data for identifying genes and alternatively spliced products in metazoans. These transcript sequences are frequently studied by aligning them to a genomic sequence template. For existing programs, error-prone, polymorphic and cross-species data, as well as non-canonical splice sites, still present significant barriers to producing accurate, complete alignments. Results: We took a novel approach to spliced alignment that meaningfully combined information from sequence similarity with that obtained from PSSM splice site models. Scoring systems were chosen to maximize their power of discrimination, and dynamic programming (DP) was employed to guarantee optimal solutions would be found. The resultant program, EXALIN, performed better than other popular tools tested under a wide range of conditions that included detection of micro-exons and human--mouse cross-species comparisons. For improved speed with only a marginal decrease in splice site prediction accuracy, EXALIN could perform limited DP guided by a result from BLASTN. Availability: The source code, binaries, scripts, scoring matrices and splice site models for human, mouse, rice and Caenorhabditis elegans utilized in this study are posted at http://blast.wustl.edu/exalin. The software (scripts, source code and binaries) is copyrighted but free for all to use. Contact: gish@blast.wustl.edu Supplementary information: http://blast.wustl.edu/exalin/exalin-supplement.pdf

40 citations


"Minimap2: pairwise alignment for nu..." refers methods in this paper

  • ...(6) is almost equivalent to the equation used by EXALIN (Zhang and Gish, 2006) except that we allow insertions immediately followed by deletions and vice versa; in addition, we use Suzuki’s diagonal formulation in actual implementation....

    [...]

Journal ArticleDOI
TL;DR: A divide‐and‐conquer algorithm, called Kart, which can process long reads as fast as short reads by dividing a read into small fragments that can be aligned independently, and can tolerate much higher error rates.
Abstract: Motivation Next-generation sequencing (NGS) provides a great opportunity to investigate genome-wide variation at nucleotide resolution. Due to the huge amount of data, NGS applications require very fast and accurate alignment algorithms. Most existing algorithms for read mapping basically adopt seed-and-extend strategy, which is sequential in nature and takes much longer time on longer reads. Results We develop a divide-and-conquer algorithm, called Kart, which can process long reads as fast as short reads by dividing a read into small fragments that can be aligned independently. Our experiment result indicates that the average size of fragments requiring the more time-consuming gapped alignment is around 20 bp regardless of the original read length. Furthermore, it can tolerate much higher error rates. The experiments show that Kart spends much less time on longer reads than other aligners and still produce reliable alignments even when the error rate is as high as 15%. Availability and implementation Kart is available at https://github.com/hsinnan75/Kart/ . Contact hsu@iis.sinica.edu.tw. Supplementary information Supplementary data are available at Bioinformatics online.

39 citations

Journal ArticleDOI
TL;DR: Long approximate matches-based split aligner (LAMSA) as mentioned in this paper takes advantage of the rareness of structural variants to implement a specifically designed two-step strategy, which splits the read into relatively long fragments and co-linearly align them to solve the small variations or sequencing errors, and mitigate the effect of repeats.
Abstract: Motivation Read length is continuously increasing with the development of novel high-throughput sequencing technologies, which has enormous potentials on cutting-edge genomic studies. However, longer reads could more frequently span the breakpoints of structural variants (SVs) than that of shorter reads. This may greatly influence read alignment, since most state-of-the-art aligners are designed for handling relatively small variants in a co-linear alignment framework. Meanwhile, long read alignment is still not as efficient as that of short reads, which could be also a bottleneck for the upcoming wide application. Results We propose long approximate matches-based split aligner (LAMSA), a novel split read alignment approach. It takes the advantage of the rareness of SVs to implement a specifically designed two-step strategy. That is, LAMSA initially splits the read into relatively long fragments and co-linearly align them to solve the small variations or sequencing errors, and mitigate the effect of repeats. The alignments of the fragments are then used for implementing a sparse dynamic programming-based split alignment approach to handle the large or non-co-linear variants. We benchmarked LAMSA with simulated and real datasets having various read lengths and sequencing error rates, the results demonstrate that it is substantially faster than the state-of-the-art long read aligners; meanwhile, it also has good ability to handle various categories of SVs. Availability and implementation LAMSA is available at https://github.com/hitbc/LAMSA CONTACT: Ydwang@hit.edu.cnSupplementary information: Supplementary data are available at Bioinformatics online.

29 citations

Posted ContentDOI
22 Nov 2017-bioRxiv
TL;DR: A new benchmark dataset is derived from the de novo PacBio assemblies of two human cell lines that are homozygous across the whole genome that provides a more accurate and less biased estimate of the error rate of small variant calls in a realistic context.
Abstract: Constructed from the consensus of multiple variant callers based on short-read data, existing benchmark datasets for evaluating variant calling accuracy are biased toward easy regions accessible by known algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two human cell lines that are homozygous across the whole genome. This benchmark provides a more accurate and less biased estimate of the error rate of small variant calls in a realistic context.

17 citations