scispace - formally typeset
Open AccessJournal ArticleDOI

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data

TLDR
GotCloud is presented, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data that automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information.
Abstract
The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.

Daniel Taliun, +205 more
- 10 Feb 2021 - 
TL;DR: The Trans-Omics for Precision Medicine (TOPMed) project as discussed by the authors aims to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases.
Posted ContentDOI

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Daniel Taliun, +194 more
- 06 Mar 2019 - 
TL;DR: The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation as well as resources and early insights from the sequence data.
Journal ArticleDOI

Discovery of common and rare genetic risk variants for colorectal cancer

Jeroen R. Huyghe, +224 more
- 01 Jan 2019 - 
TL;DR: Genome-wide association analyses based on whole-genome sequencing and imputation identify 40 new risk variants for colorectal cancer, including a strongly protective low-frequency variant at CHD1 and loci implicating signaling and immune function in disease etiology.
Journal ArticleDOI

Inherited causes of clonal haematopoiesis in 97,691 whole genomes.

Alexander G. Bick, +165 more
- 14 Oct 2020 - 
TL;DR: Analysis of high-coverage whole-genome sequences from 97,691 participants of diverse ancestries in the National Heart, Lung, and Blood Institute Trans-omics for Precision Medicine programme enables simultaneous identification of germline and somatic mutations that predispose individuals to clonal expansion of haematopoietic stem cells.
References
More filters
Journal ArticleDOI

Fast and accurate short read alignment with Burrows–Wheeler transform

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

Support-Vector Networks

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Journal ArticleDOI

An integrated map of genetic variation from 1,092 human genomes

TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
Related Papers (5)

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 -