Abstract: Exome sequencing of tumor samples and matched normal controls has the potential to rapidly identify protein-altering mutations across hundreds of patients, potentially enabling the discovery of recurrent events driving tumor development and growth (International Cancer Genome Consortium 2010; Stratton 2011). Yet the analysis of such data presents significant challenges. Sequencing coverage is nonuniform across targeted regions and from one sample to the next (Ng et al. 2009; Bainbridge et al. 2010; Teer et al. 2010). Many regions achieve high read depth (more than 100×), which can confound variant callers and depth-based filters if not properly addressed (Ku et al. 2011). Repetitive and paralogous sequences can give rise to numerous false positives. The detection of somatic mutations in tumor genomes is even more challenging. The genomes of primary tumors are genetically heterogeneous (Ding et al. 2010), with frequent rearrangements (Campbell et al. 2008) and copy number alterations (CNAs) (Beroukhim et al. 2010). Further, somatic mutations are relatively rare compared with germline variation, often representing <0.1% of variants in a tumor genome (Ley et al. 2008; Mardis et al. 2009). Simply subtracting variants in the matched normal from variants in the tumor (Wei et al. 2011) is poorly suited for the analysis of exome sequence data, because it fails to account for regions that were undersampled in the normal. Accurate mutation detection requires a direct, simultaneous comparison of tumor–normal pairs at every position in the exome, but few algorithms to do so have been described.
Numerous algorithms have been developed to assess genome-wide copy number using whole-genome sequencing (WGS) data. Most of these approaches (Campbell et al. 2008; Alkan et al. 2009; Chiang et al. 2009; Yoon et al. 2009; Abyzov et al. 2011) would be confounded by exome data sets, because of the biases introduced by hybridization and the sparse and uneven coverages throughout the genome. However, when both DNA samples in a tumor–normal pair were captured and sequenced under identical hybridization conditions, we reasoned that it might be possible to detect somatic CNAs (SCNAs) as deviations from the log-ratio of sequence coverage depth within a tumor–normal pair, and then quantify the deviations statistically. Such an approach would provide a gene-centric view of copy number in a tumor sample, though it would be limited to the ∼1% of the genome captured by current exome platforms.
Previously, we published VarScan (Koboldt et al. 2009), an algorithm for variant detection in next-generation sequencing data. We have since released a new tool, VarScan 2 (http://varscan.sourceforge.net), with several improvements, including the ability to identify somatic mutation, loss of heterozygosity (LOH), and CNA events in tumor–normal pairs. VarScan 2 analyzes sequence data from a tumor sample and its corresponding normal sample simultaneously, applying heuristic methods and a statistical test to detect variants—single nucleotide variants (SNVs) and insertions/deletions (indels)—and classify them by somatic status. By direct comparison of normalized sequence depth, our method also detects SCNAs in the tumor genome.
Here, we utilize VarScan 2 for the analysis of exome sequence data from 151 patients with high-grade serous ovarian adenocarcinoma (HGS-OVCa) that were initially characterized within the Cancer Genome Atlas (TCGA) project (Cancer Genome Atlas Research Network 2011). We present a robust pipeline for the detection of both germline (inherited) and somatic (acquired) mutations by exome sequencing and describe filtering approaches for detecting variants with high sensitivity and specificity. To evaluate the performance of our SCNA detection algorithm, we compare our results to copy number data from high-density SNP array and WGS approaches. Our results demonstrate the accuracy of VarScan 2 for somatic mutation and CNA detection and enable a new survey of the genetic landscape in ovarian carcinoma.
... read more