What are the contributions mentioned in the paper "False negatives are a significant feature of next generation sequencing callsets" ?

In this paper, the authors quantify the false negative rate in short read, next-generation sequencing ( NGS ) data.

What is the effect of such pipelines?

Such pipelines tend to optimize filtering out false positive variants, which are highly prevalent in raw 2nd generation sequencing data (DePristo et al.

How can the authors model a mtDNA variant as being inherited identically by descent?

In the absence of recombination, any given contiguous sequence of nucleotides can be modeled as being inherited identically by descent (IBD) by creating a phylogenetic tree of shared and derived mutations.

How did the authors compute the depth of coverage for each base pair location in each sample?

To compute the depth of coverage for each base pair location in each sample in their Illumina data, the authors used GATK’s DepthOfCoverage (McKenna et al. 2010).

How many variants were excluded from the mother’s sequence?

For 9 of the remaining candidate mutations, the variants in the mother’s sequence were predicted to be present based on the mother’s phylogenetic lineage, so the corresponding candidate mutations were excluded.

What is the way to evaluate false negatives in haploid systems?

While PhyloFaN can be used to systematically explore the effect of pipeline parameters on the false negative in haploid systems, it is an imperfect proxy for assaying autosomal data.

How many variants were missing from the NGS dataset?

In the Complete Genomics dataset, their algorithm estimates that 2,313 out of11,429 predicted variants were missing from the NGS variant callset.

Why did the authors exclude indels from the analysis?

For consistency, the authors excluded indels from this analysis so the autosomal and mitochondrial false negative rates could be compared.

How does the logit model predict the probability of false negative status?

A logit model with these parameters predicts that an increase in coverage from 2,000 to 3,000 reads leads to a decrease in the probability of false negative status from 17.3% to 15.8%.

What was the assumption that the mtDNA had only one allele in the individual?

During preparation of the callset, it was assumed that for any given locus the mtDNA has only one allele in a particular individual and heterozygous sites were removed.

(Open Access) False Negatives Are a Significant Feature of Next Generation Sequencing Callsets (2016) | Dean Bobo

Q: What is the importance of a balanced assessment of false positive and false negative error rates?

A balanced assessment of both false positive and false negative error rates is necessary for Mendelian and complex disease identification approaches, but also crucial for evolutionary studies of mutation rates (Ségurel et al. 2014).

Q: What is the reason for the high number of candidate de novo mutations in human genome?

There is often a high number of candidate de novo mutations identified in trio/duo, but most candidates are a result of either a false positive in the offspring or a false negative in a parent (Girard et al.

UC Davis

UC Davis Previously Published Works

Title

False Negatives Are a Significant Feature of Next Generation Sequencing Callsets

Permalink

https://escholarship.org/uc/item/0k20n6hq

Authors

Bobo, Dean

Lipatov, Mikhail

Rodriguez-Flores, Juan

et al.

Publication Date

2016

DOI

10.1101/066043

Peer reviewed

eScholarship.org Powered by the California Digital Library

University of California

Title: False Negatives Are a Significant Feature of Next Generation Sequencing Callsets

Authors: Dean Bobo

, Mikhail Lipatov

, Juan L. Rodriguez-Flores

, Adam Auton

and

Brenna M. Henn

1,4§

Department of Ecology and Evolution, Stony Brook University, Stony Brook, NY,

11794, USA.

Department of Genetic Medicine, Weill Cornell Medical College. New York, NY,

10021, USA.

Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, 10461, USA.

Graduate Program in Genetics, Stony Brook University, Stony Brook, NY, 11794,

USA.

Correspondence should be addressed to: Brenna Henn, Dept. of Ecology and Evolution,

Life Sciences Bldg., Room 640, Stony Brook NY 11794. Phone: 631-632-1412.

E-mail: brenna.henn@stonybrook.edu

Key Words: sequencing error, mutation rate, de novo mutations, next-generation

sequencing

Data deposition: Data and software are freely available on the Henn Lab website:

https://ecoevo.stonybrook.edu/hennlab/data-software/

Software: GITHUB via https://ecoevo.stonybrook.edu/hennlab/data-software/

The copyright holder for this preprint (which was notthis version posted October 18, 2016. . https://doi.org/10.1101/066043doi: bioRxiv preprint

Abstract:

Short-read, next-generation sequencing (NGS) is now broadly used to identify rare or de

novo mutations in population samples and disease cohorts. However, NGS data is known

to be error-prone and post-processing pipelines have primarily focused on the removal of

spurious mutations or "false positives" for downstream genome datasets. Less attention

has been paid to characterizing the fraction of missing mutations or "false negatives"

(FN). Here we interrogate several publically available human NGS autosomal variant

datasets using corresponding Sanger sequencing as a truth-set. We examine both low-

coverage Illumina and high-coverage Complete Genomics genomes. We show that the

FN rate varies between 3%-18% and that false-positive rates are considerably lower

(<3%) for publically available human genome callsets like 1000 Genomes. The FN rate is

strongly dependent on calling pipeline parameters, as well as read coverage. Our results

demonstrate that missing mutations are a significant feature of genomic datasets and

imply additional fine-tuning of bioinformatics pipelines is needed. To address this, we

design a phylogeny-aware tool [PhyloFaN] which can be used to quantify the FN rate for

haploid genomic experiments, without additional generation of validation data. Using

PhyloFaN on ultra-high coverage NGS data from both Illumina HiSeq and Complete

Genomics platforms derived from the 1000 Genomes Project, we characterize the false

negative rate in human mtDNA genomes. The false negative rate for the publically

available mtDNA callsets is 17-20%, even for extremely high coverage haploid data.

The copyright holder for this preprint (which was notthis version posted October 18, 2016. . https://doi.org/10.1101/066043doi: bioRxiv preprint

Introduction

Mutation is the process by which novel genetic variation is generated; thus, the

accurate identification of mutations in genomic data is of the utmost importance for

mapping Mendelian disease, population genetic analysis, tumor sequencing, and rare

variant phenotype/genotype associations (Shendure and Akey 2015). Multiple

bioinformatic algorithms have been developed to call mutations from short read, next-

generation sequencing (NGS) data (DePristo et al. 2011; Ramu et al. 2013; Pabinger et al.

2014). However, there is a growing consensus that both short- and long-read NGS

associated calling methods generate datasets with appreciably high error rates,

particularly for rare or de novo mutations (Wall et al. 2014; Ségurel et al. 2014; O’Rawe

et al. 2015). These technical error profiles affect many forms of human genomic data, and

are particularly crucial for the identification of de novo mutations in disease phenotypes

(Kong et al. 2012; Ng et al. 2010; Bamshad et al. 2011) and somatic tissue (Tomasetti et

al. 2013; Costa et al. 2015). Raw 2

generation sequencing read data contains a great

number of false positive variants (i.e. referred to as “sequencing error”) !

(Robasky et al. 2013; Reumers et al. 2011). Accordingly, pre- and post-processing

pipelines filter the raw data in order to discard false positive variants. However, such

pipelines may also remove true variants, which will then result in a relatively high false

negative rate in the variant callset.

Recent efforts to quantify NGS error rates have primarily been focused on the

identification of false positive errors in human NGS data (Zook et al. 2014; Kennedy et

al. 2014). However, the need for the quantification of false negatives in such data has

The copyright holder for this preprint (which was notthis version posted October 18, 2016. . https://doi.org/10.1101/066043doi: bioRxiv preprint

received far less attention (Brandt et al. 2015; Pabinger et al. 2014). High error rates

complicate disease studies which search for de novo disease mutations between parents

and probands with exome or genome sequencing. There is often a high number of

candidate de novo mutations identified in trio/duo, but most candidates are a result of

either a false positive in the offspring or a false negative in a parent

(Girard et al. 2011; Veeramah et al. 2013; Vissers et al. 2010). For example, Vissers et

al. (Vissers et al. 2010) identify 51 candidate de novo mutations in ten probands with

mental retardation, but were only able to validate 13 with Sanger sequencing. Sanger

validation of the parents revealed that only 9 of these were truly de novo, the remaining 4

were likely false negatives in the parents (i.e. 30% false negative rate). Other studies

identify similarly high false negative rates (Michaelson et al. 2012), but the precise ratio

in a given study will depend on many factors. For example, in the context of trio

pedigree-based calling, filtering for mutations which are already present in a large SNP

repository, such as dbSNP, will mean that recurrent de novo mutations are eliminated

from the final callset; recent work with the EXaC database specifically highlights this

problem (Lek et al. 2016). Recently, Chen et al. (Chen et al. 2016) report that damage

introduced in-vitro during NGS library preparation results in a high number of spurious

variants, and estimate that this damage causes the majority of G to T transversions in

73% of large, publically available datasets (i.e. 1000G and the Cancer Genome Atlas

[TCGA]). A balanced assessment of both false positive and false negative error rates is

necessary for Mendelian and complex disease identification approaches, but also crucial

for evolutionary studies of mutation rates (Ségurel et al. 2014).

The copyright holder for this preprint (which was notthis version posted October 18, 2016. . https://doi.org/10.1101/066043doi: bioRxiv preprint

False Negatives Are a Significant Feature of Next Generation Sequencing Callsets

Figures

Citations

The presence and impact of reference bias on population genomic studies of prehistoric human populations.

Ultrarare variants drive substantial cis heritability of human gene expression.

No Evidence for Recent Selection at FOXP2 among Diverse Human Populations

The presence and impact of reference bias on population genomic studies of prehistoric human populations

Advances and Trends in Omics Technology Development

References

Fast and accurate short read alignment with Burrows–Wheeler transform

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

A global reference for human genetic variation.

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Sequence and organization of the human mitochondrial genome

Related Papers (5)

Technology-specific error signatures in the 1000 Genomes Project data

A remark on copy number variation detection methods.

Analysis of error profiles in deep next-generation sequencing data

Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets.

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "False negatives are a significant feature of next generation sequencing callsets" ?

Q2. What is the effect of such pipelines?

Q3. How can the authors model a mtDNA variant as being inherited identically by descent?

Q4. How did the authors compute the depth of coverage for each base pair location in each sample?

Q5. How many variants were excluded from the mother’s sequence?

Q6. What is the way to evaluate false negatives in haploid systems?

Q7. How many variants were missing from the NGS dataset?

Q8. What is the importance of a balanced assessment of false positive and false negative error rates?

Q9. What is the reason for the high number of candidate de novo mutations in human genome?

Q10. Why did the authors exclude indels from the analysis?

Q11. How does the logit model predict the probability of false negative status?

Q12. What was the assumption that the mtDNA had only one allele in the individual?