scispace - formally typeset
Open AccessPosted ContentDOI

False Negatives Are a Significant Feature of Next Generation Sequencing Callsets

TLDR
It is shown that missing mutations are a significant feature of genomic datasets and imply additional fine-tuning of bioinformatics pipelines is needed, and a phylogeny-aware tool is designed which can be used to quantify the FN rate for haploid genomic experiments, without additional generation of validation data.
Abstract
Author(s): Bobo, Dean; Lipatov, Mikhail; Rodriguez-Flores, Juan; Auton, Adam; Henn, Brenna | Abstract: Short-read, next-generation sequencing (NGS) is now broadly used to identify rare or de novo mutations in population samples and disease cohorts. However, NGS data is known to be error-prone and post-processing pipelines have primarily focused on the removal of spurious mutations or “false positives” for downstream genome datasets. Less attention has been paid to characterizing the fraction of missing mutations or “false negatives” (FN). Here we interrogate several publically available human NGS autosomal variant datasets using corresponding Sanger sequencing as a truth-set. We examine both low-coverage Illumina and high-coverage Complete Genomics genomes. We show that the FN rate varies between 3%-18% and that false-positive rates are considerably lower (l3%) for publically available human genome callsets like 1000 Genomes. The FN rate is strongly dependent on calling pipeline parameters, as well as read coverage. Our results demonstrate that missing mutations are a significant feature of genomic datasets and imply additional fine-tuning of bioinformatics pipelines is needed. To address this, we design a phylogeny-aware tool [PhyloFaN] which can be used to quantify the FN rate for haploid genomic experiments, without additional generation of validation data. Using PhyloFaN on ultra-high coverage NGS data from both Illumina HiSeq and Complete Genomics platforms derived from the 1000 Genomes Project, we characterize the false negative rate in human mtDNA genomes. The false negative rate for the publically available mtDNA callsets is 17-20%, even for extremely high coverage haploid data.

read more

Content maybe subject to copyright    Report

UC Davis
UC Davis Previously Published Works
Title
False Negatives Are a Significant Feature of Next Generation Sequencing Callsets
Permalink
https://escholarship.org/uc/item/0k20n6hq
Authors
Bobo, Dean
Lipatov, Mikhail
Rodriguez-Flores, Juan
et al.
Publication Date
2016
DOI
10.1101/066043
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California

!
1!
Title: False Negatives Are a Significant Feature of Next Generation Sequencing Callsets
Authors: Dean Bobo
1
, Mikhail Lipatov
1
, Juan L. Rodriguez-Flores
2
, Adam Auton
3
and
Brenna M. Henn
1,
1
Department of Ecology and Evolution, Stony Brook University, Stony Brook, NY,
11794, USA.
2
Department of Genetic Medicine, Weill Cornell Medical College. New York, NY,
10021, USA.
3
Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, 10461, USA.
4
Graduate Program in Genetics, Stony Brook University, Stony Brook, NY, 11794,
USA.
§
Correspondence should be addressed to: Brenna Henn, Dept. of Ecology and Evolution,
Life Sciences Bldg., Room 640, Stony Brook NY 11794. Phone: 631-632-1412.
E-mail: brenna.henn@stonybrook.edu
Key Words: sequencing error, mutation rate, de novo mutations, next-generation
sequencing
Data deposition: Data and software are freely available on the Henn Lab website:
https://ecoevo.stonybrook.edu/hennlab/data-software/
Software: GITHUB via https://ecoevo.stonybrook.edu/hennlab/data-software/
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 18, 2016. . https://doi.org/10.1101/066043doi: bioRxiv preprint

!
2!
Abstract:
Short-read, next-generation sequencing (NGS) is now broadly used to identify rare or de
novo mutations in population samples and disease cohorts. However, NGS data is known
to be error-prone and post-processing pipelines have primarily focused on the removal of
spurious mutations or "false positives" for downstream genome datasets. Less attention
has been paid to characterizing the fraction of missing mutations or "false negatives"
(FN). Here we interrogate several publically available human NGS autosomal variant
datasets using corresponding Sanger sequencing as a truth-set. We examine both low-
coverage Illumina and high-coverage Complete Genomics genomes. We show that the
FN rate varies between 3%-18% and that false-positive rates are considerably lower
(<3%) for publically available human genome callsets like 1000 Genomes. The FN rate is
strongly dependent on calling pipeline parameters, as well as read coverage. Our results
demonstrate that missing mutations are a significant feature of genomic datasets and
imply additional fine-tuning of bioinformatics pipelines is needed. To address this, we
design a phylogeny-aware tool [PhyloFaN] which can be used to quantify the FN rate for
haploid genomic experiments, without additional generation of validation data. Using
PhyloFaN on ultra-high coverage NGS data from both Illumina HiSeq and Complete
Genomics platforms derived from the 1000 Genomes Project, we characterize the false
negative rate in human mtDNA genomes. The false negative rate for the publically
available mtDNA callsets is 17-20%, even for extremely high coverage haploid data.
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 18, 2016. . https://doi.org/10.1101/066043doi: bioRxiv preprint

!
3!
Introduction
Mutation is the process by which novel genetic variation is generated; thus, the
accurate identification of mutations in genomic data is of the utmost importance for
mapping Mendelian disease, population genetic analysis, tumor sequencing, and rare
variant phenotype/genotype associations (Shendure and Akey 2015). Multiple
bioinformatic algorithms have been developed to call mutations from short read, next-
generation sequencing (NGS) data (DePristo et al. 2011; Ramu et al. 2013; Pabinger et al.
2014). However, there is a growing consensus that both short- and long-read NGS
associated calling methods generate datasets with appreciably high error rates,
particularly for rare or de novo mutations (Wall et al. 2014; Ségurel et al. 2014; O’Rawe
et al. 2015). These technical error profiles affect many forms of human genomic data, and
are particularly crucial for the identification of de novo mutations in disease phenotypes
(Kong et al. 2012; Ng et al. 2010; Bamshad et al. 2011) and somatic tissue (Tomasetti et
al. 2013; Costa et al. 2015). Raw 2
nd
generation sequencing read data contains a great
number of false positive variants (i.e. referred to as “sequencing error”) !
(Robasky et al. 2013; Reumers et al. 2011). Accordingly, pre- and post-processing
pipelines filter the raw data in order to discard false positive variants. However, such
pipelines may also remove true variants, which will then result in a relatively high false
negative rate in the variant callset.
Recent efforts to quantify NGS error rates have primarily been focused on the
identification of false positive errors in human NGS data (Zook et al. 2014; Kennedy et
al. 2014). However, the need for the quantification of false negatives in such data has
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 18, 2016. . https://doi.org/10.1101/066043doi: bioRxiv preprint

!
4!
received far less attention (Brandt et al. 2015; Pabinger et al. 2014). High error rates
complicate disease studies which search for de novo disease mutations between parents
and probands with exome or genome sequencing. There is often a high number of
candidate de novo mutations identified in trio/duo, but most candidates are a result of
either a false positive in the offspring or a false negative in a parent
(Girard et al. 2011; Veeramah et al. 2013; Vissers et al. 2010). For example, Vissers et
al. (Vissers et al. 2010) identify 51 candidate de novo mutations in ten probands with
mental retardation, but were only able to validate 13 with Sanger sequencing. Sanger
validation of the parents revealed that only 9 of these were truly de novo, the remaining 4
were likely false negatives in the parents (i.e. 30% false negative rate). Other studies
identify similarly high false negative rates (Michaelson et al. 2012), but the precise ratio
in a given study will depend on many factors. For example, in the context of trio
pedigree-based calling, filtering for mutations which are already present in a large SNP
repository, such as dbSNP, will mean that recurrent de novo mutations are eliminated
from the final callset; recent work with the EXaC database specifically highlights this
problem (Lek et al. 2016). Recently, Chen et al. (Chen et al. 2016) report that damage
introduced in-vitro during NGS library preparation results in a high number of spurious
variants, and estimate that this damage causes the majority of G to T transversions in
73% of large, publically available datasets (i.e. 1000G and the Cancer Genome Atlas
[TCGA]). A balanced assessment of both false positive and false negative error rates is
necessary for Mendelian and complex disease identification approaches, but also crucial
for evolutionary studies of mutation rates (Ségurel et al. 2014).
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 18, 2016. . https://doi.org/10.1101/066043doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

The presence and impact of reference bias on population genomic studies of prehistoric human populations.

TL;DR: It is illustrated that the strength of reference bias is negatively correlated with fragment length, which has the potential to cause minor but significant differences in the results of downstream analyses such as population allele sharing, heterozygosity estimates and estimates of archaic ancestry.
Journal ArticleDOI

Ultrarare variants drive substantial cis heritability of human gene expression.

TL;DR: An approach to estimate the contribution of all alleles to phenotypic variation is applied to transcription regulation using whole-genome sequencing and transcriptome data and an inference procedure is developed to demonstrate that the results are consistent with pervasive purifying selection shaping the regulatory architecture of most human genes.
Journal ArticleDOI

No Evidence for Recent Selection at FOXP2 among Diverse Human Populations

TL;DR: A substantial revision to the adaptive history of FOXP2, a gene regarded as vital to human evolution, is presented, finding an intronic region that is enriched for highly conserved sites that are polymorphic among humans, compatible with a loss of function in humans.
Posted ContentDOI

The presence and impact of reference bias on population genomic studies of prehistoric human populations

TL;DR: It is illustrated that the strength of reference bias is negatively correlated with fragment length, which can cause differences in the results of downstream analyses such as population affinities, heterozygosity estimates and estimates of archaic ancestry.
Journal ArticleDOI

Advances and Trends in Omics Technology Development

Xiaofeng Dai, +1 more
TL;DR: Redoxomics is predicted as an emerging omics layer that views cell decision toward the physiological or pathological state as a fine-tuned redox balance and delineates hierarchies of these omics together with their epiomics and interactomics.
References
More filters
Journal ArticleDOI

Fast and accurate short read alignment with Burrows–Wheeler transform

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Journal ArticleDOI

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Journal ArticleDOI

Sequence and organization of the human mitochondrial genome

TL;DR: The complete sequence of the 16,569-base pair human mitochondrial genome is presented and shows extreme economy in that the genes have none or only a few noncoding bases between them, and in many cases the termination codons are not coded in the DNA but are created post-transcriptionally by polyadenylation of the mRNAs.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "False negatives are a significant feature of next generation sequencing callsets" ?

In this paper, the authors quantify the false negative rate in short read, next-generation sequencing ( NGS ) data. 

Such pipelines tend to optimize filtering out false positive variants, which are highly prevalent in raw 2nd generation sequencing data (DePristo et al. 

In the absence of recombination, any given contiguous sequence of nucleotides can be modeled as being inherited identically by descent (IBD) by creating a phylogenetic tree of shared and derived mutations. 

To compute the depth of coverage for each base pair location in each sample in their Illumina data, the authors used GATK’s DepthOfCoverage (McKenna et al. 2010). 

For 9 of the remaining candidate mutations, the variants in the mother’s sequence were predicted to be present based on the mother’s phylogenetic lineage, so the corresponding candidate mutations were excluded. 

While PhyloFaN can be used to systematically explore the effect of pipeline parameters on the false negative in haploid systems, it is an imperfect proxy for assaying autosomal data. 

In the Complete Genomics dataset, their algorithm estimates that 2,313 out of11,429 predicted variants were missing from the NGS variant callset. 

A balanced assessment of both false positive and false negative error rates is necessary for Mendelian and complex disease identification approaches, but also crucial for evolutionary studies of mutation rates (Ségurel et al. 2014). 

There is often a high number of candidate de novo mutations identified in trio/duo, but most candidates are a result of either a false positive in the offspring or a false negative in a parent (Girard et al. 

For consistency, the authors excluded indels from this analysis so the autosomal and mitochondrial false negative rates could be compared. 

A logit model with these parameters predicts that an increase in coverage from 2,000 to 3,000 reads leads to a decrease in the probability of false negative status from 17.3% to 15.8%. 

During preparation of the callset, it was assumed that for any given locus the mtDNA has only one allele in a particular individual and heterozygous sites were removed.