scispace - formally typeset
Open AccessBook ChapterDOI

Whole-Genome Shotgun Sequence CNV Detection Using Read Depth.

Fatma Kahveci, +1 more
- 01 Jan 2018 - 
- Vol. 1833, pp 61-72
Reads0
Chats0
TLDR
A guideline for a user friendly tool for detecting large segmental duplications and deletions that can also predict integer copy numbers for duplicated genes is provided.
Abstract
With the developments in high-throughput sequencing (HTS) technologies, researchers have gained a powerful tool to identify structural variants (SVs) in genomes with substantially less cost than before. SVs can be broadly classified into two main categories: balanced rearrangements and copy number variations (CNVs). Many algorithms have been developed to characterize CNVs using HTS data, with focus on different types and size range of variants using different read signatures. Read depth (RD) based tools are more common in characterizing large (>10 kb) CNVs since RD strategy does not rely on the fragment size and read length, which are limiting factors in read pair and split read analysis. Here we provide a guideline for a user friendly tool for detecting large segmental duplications and deletions that can also predict integer copy numbers for duplicated genes.

read more

Content maybe subject to copyright    Report

61
Derek M. Bickhart (ed.), Copy Number Variants: Methods and Protocols, Methods in Molecular Biology, vol. 1833,
https://doi.org/10.1007/978-1-4939-8666-8_4, © Springer Science+Business Media, LLC, part of Springer Nature 2018
Chapter 4
Whole-Genome Shotgun Sequence CNV Detection
Using Read Depth
Fatma Kahveci and Can Alkan
Abstract
With the developments in high-throughput sequencing (HTS) technologies, researchers have gained a
powerful tool to identify structural variants (SVs) in genomes with substantially less cost than before. SVs
can be broadly classified into two main categories: balanced rearrangements and copy number variations
(CNVs). Many algorithms have been developed to characterize CNVs using HTS data, with focus on dif-
ferent types and size range of variants using different read signatures. Read depth (RD) based tools are
more common in characterizing large (>10 kb) CNVs since RD strategy does not rely on the fragment size
and read length, which are limiting factors in read pair and split read analysis. Here we provide a guideline
for a user friendly tool for detecting large segmental duplications and deletions that can also predict integer
copy numbers for duplicated genes.
Key words Copy number variation, Whole genome shotgun sequencing, Read depth, mrFAST,
mrsFAST
Abbreviations
CNV Copy number variation
mrCaNaVaR Micro read Copy Number Variant Regions
mrFAST Micro read Fast Alignment Search Tool
mrsFAST Micro read substitution only Fast Alignment Search Tool
RD Read depth
TRF Tandem repeat finder
WGS Whole genome sequencing
1 Introduction
CNVs are changes in the amount of DNA in a genome that show
themselves as duplication (gain) and deletion (loss) events [1].
Traditionally, a copy number variant is defined as a gain or loss in
the amount of DNA in the genome that is larger than 1 Kb,

62
although smaller deletions or duplications (>50 bp) are considered
CNV in several projects such as the 1000 Genomes Project. CNVs
are known to have played major role in evolution [24], and sev-
eral of them are correlated with complex human disease [5, 6].
There exist many algorithms to characterize CNVs in genomes
that employ different strategies to characterize different classes of
CNVs in different size ranges. Here we describe the mrCaNaVaR
algorithm [7] that uses read depth information to predict large
(>10 Kb) duplications and deletions, along with integer copy num-
bers for genes. mrCaNaVaR was the first tool developed solely for
accurate assessment of segmental duplications on personal
genomes, and it also was the only publicly available tool for integer
copy number prediction. Although it was initially developed to
analyze only human segmental duplications, it has been used to
characterize large CNVs in many organisms including cattle [8],
great apes [3, 9], Neandertal [10] and Denisovan [11, 12] ancient
DNA, and even plant genomes [13].
Here we describe how to use mrCaNaVaR for detecting large
CNVs. Since segmental duplication prediction necessitates track-
ing multiple map locations of reads, we need to use a multi-mapper
[14]. Briefly, after mapping all reads to a repeat masked genome
(to “clean up” common repeats) using mrFAST or mrsFAST
[15, 16], mrCaNaVaR loads the mapping information (i.e., SAM
files) and applies a sliding window strategy to count read depth.
The read depth distribution is often affected by G + C content of
genomic segments, therefore mrCaNaVaR applies a statistical
smoothing method (LOESS) to correct for bias in high and low
G + C regions. It then identifies those regions with higher or lower
read depth than the genome- wide average (±3 standard devia-
tions), and chains consecutive such regions. mrCaNaVaR also pre-
dicts integer copy numbers of windows of size 1 Kb as the ratio of
observed read depth to average read depth, and uses this informa-
tion to refine CNV predictions and the estimated breakpoints.
2 Materials
This section contains a list of prerequisite software tools that
mrCaNaVaR requires. All tools described below are open source
and free to use under the GNU public license.
1. Download the zlib program from https://sourceforge.net/
projects/libpng/files/zlib/. An example of a Linux command
to use is as follows (see Notes 1, 2, 3, and 4):
$ wget https://sourceforge.net/projects/libpng/files/zlib/
X.X.X/zlib-X.X.X.tar.gz
where X.X.X denotes version number.
2.1 Installing Zlib
Fatma Kahveci and Can Alkan

63
2. Unpack the downloaded file using the command shown below:
$ tar -xzvf zlib-X.X.X.tar.gz
where X.X.X denotes version number.
There should be a new folder named zlib-X.X.X in the current
working directory.
3. Remove the zipped version of file:
$ rm zlib-X.X.X.tar.gz
4. Change directory into zlib-X.X.X:
$ cd zlib-X.X.X
5. Type at the shell prompt:
$ ./configure
$ make test
If everything goes well, then type:
$ sudo make install
Depending on the version of the GCC compiler and some
additional library requirements, the installation should be com-
pleted without any errors (see Note 5).
1. Download the Blast program from ftp://ftp.ncbi.nlm.nih.
gov/blast/executables/LATEST/. An example of a Linux
command to use is as follows:
$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/
LATEST/ncbi-blast-X.X.X+-x64-linux.tar.gz
where X.X.X denotes version number.
2. Unpack the downloaded file using the command shown below:
$ tar -xzvf ncbi-blast-X.X.X+-x64-linux.tar.gz
where X.X.X denotes version number.
There should be a new folder named ncbi-blast-X.X.X+ in the
current working directory.
3. Remove the zipped version of file:
$ rm ncbi-blast-X.X.X+-x64-linux.tar.gz
4. Change directory into ncbi-blast-X.X.X+/bin:
$ cd ncbi-blast-X.X.X+/bin
5. Download ncbi-rmblastn from http://ftp.ncbi.nlm.nih.gov/
blast/executables/rmblast/.
$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/rmblast/
X.X.X/ncbi-rmblastn-X.X.X-x64-linux.tar.gz
where X.X.X denotes version number.
6. Unpack the downloaded file using the command shown below:
$ tar -xzvf ncbi-rmblastn-X.X.X-x64-linux.tar.gz
2.2 Installing Blast
WGS CNV Detection Using RD

64
where X.X.X denotes version number.
There should be a new folder named ncbi-rmblastn-X.X.X in
the current working directory.
7. Copy rmblastn file into /path/ncbi-blast-X.X.X+/bin
directory:
$ cp /path/ncbi-rmblastn-X.X.X/rmblastn /path/ncbi-blast-
X.X.X+/bin
8. Remove unused files in ncbi-blast-X.X.X+/bin directory:
$ rm ncbi-rmblastn-X.X.X*
1. Change directory into RepeatMasker:
$ cd /path/RepeatMasker
2. Download “trf” file from http://tandem.bu.edu/trf/trf409.
linux64.download.html.
3. Convert binary file into executable:
$ chmod 755 trf409.linux64
4. Change file name:
$ mv trf409.linux64 trf
1. Download the RepeatMasker program from http://www.
repeatmasker.org. An example of a Linux command to use is as
follows:
$ wget http://www.repeatmasker.org/RepeatMasker-open-
X-X-X.tar.gz
where X-X-X denotes version number.
2. Unpack the downloaded file using the command shown below:
$ tar -xzvf RepeatMasker-open-X-X-X.tar.gz
There should be a new folder named RepeatMasker in the cur-
rent working directory.
3. Remove the zipped version of file:
$ rm RepeatMasker-open-X-X-X.tar.gz
4. Change directory into RepeatMasker:
$ cd RepeatMasker
5. RepeatMasker provides two open databases, Dfam and Dfam_
consensus, and will work with these datasets, but it is advised
to obtain a license for the RepBase RepeatMasker Edition to
supplement these sequences. To obtain a license and download
the library go to http://www.girinst.org. Copy RepBase into
the RepeatMasker directory:
$ cp RepBaseRepeatMaskerEdition-XXXXXXXX.tar.gz /
path/RepeatMasker/
2.3 Installing
Tandem
Repeats Finder
2.4 Installing
RepeatMasker
Fatma Kahveci and Can Alkan

65
6. Change directory into:
$ cd /path/RepeatMasker
7. Unpack the downloaded file using the command shown below:
$ tar -xzvf RepBaseRepeatMaskerEdition-########.tar.gz
There should be a new folder named Libraries in the current
working directory.
8. Remove the zipped version of file:
$ rm RepBaseRepeatMaskerEdition-########.tar.gz
9. Check for Dfam Updates (optional).
(a) Change directory into Libraries:
(b) Download the Dfam.hmm.gz file from http://www.dfam.
org.
$ wget http://www.dfam.org/web_download/Current_
Release/Dfam.hmm.gz
(c) Unpack the downloaded file using the command shown
below:
$ gunzip Dfam.hmm.gz
There should be a new file named Dfam.hmm in the current
working directory.
10. Change directory into RepeatMasker:
$ cd /path/RepeatMasker/
11. Type at the shell prompt (see Note 6):
$ perl ./configure
12. Follow the instructions.
1. Download the mrFAST program from https://github.com/
BilkentCompGen/mrfast/. An example of a Linux command
to use is as follows (see Notes 1, 2, 3, and 4):
$ wget https://github.com/BilkentCompGen/mrfast/
archive/vX.X.X.X.zip
where X.X.X.X denotes version number.
2. Unpack the downloaded file using the command shown below:
$ tar -xzvf mrfast-X.X.X.X.tar.gz
where X.X.X.X denotes version number. There should be a
new folder named mrfast-X.X.X.X in the current working
directory.
3. Remove the downloaded compressed version of the file:
$ rm mrfast-X.X.X.X.tar.gz
where X.X.X.X denotes version number.
4. Change directory into mrfast-X.X.X.X:
2.5 Installing
mrFAST
WGS CNV Detection Using RD

Citations
More filters
Journal ArticleDOI

Benchmarking challenging small variants with linked and long reads

TL;DR: Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods as mentioned in this paper , which includes more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously.
Journal ArticleDOI

CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data

TL;DR: In this paper , the authors present CONGA, tailored for genotyping copy number variants (CNVs) at low coverage, which can genotype deletions and distinguish between heterozygous and homozygous states.
Dissertation

Paralog-specific gene copy number discovery within segmental duplications

Emre Doğru
TL;DR: PaCoND is presented to discover paralog specific gene copy number within segmental duplications using a sequence alignment file with unique mapping based on read depth and limited to detect only duplications and deletions.
Dissertation

Breakpoint refinement of genomic structural variation using split read analysis

TL;DR: This thesis proposes BROSV (Breakpoint Refinement of Structural Variation), a breakpoint refinement algorithm to obtain better resolution on SV breakpoints with split read analysis and local assembly methods using Illumina short reads and BWA alignment tool.
References
More filters
Journal ArticleDOI

Tandem repeats finder: a program to analyze DNA sequences

TL;DR: A new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size is presented and its ability to detect tandem repeats that have undergone extensive mutational change is demonstrated.
Journal ArticleDOI

A Draft Sequence of the Neandertal Genome

TL;DR: The genomic data suggest that Neandertals mixed with modern human ancestors some 120,000 years ago, leaving traces of Ne andertal DNA in contemporary humans, suggesting that gene flow from Neand Bertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.
Journal ArticleDOI

Genome structural variation discovery and genotyping

TL;DR: It is argued that the long-term goal should be routine, cost-effective and high quality de novo assembly of human genomes to comprehensively assess all classes of structural variation.
Related Papers (5)
Frequently Asked Questions (3)
Q1. What have the authors contributed in "Whole-genome shotgun sequence cnv detection using read depth" ?

Here the authors provide a guideline for a user friendly tool for detecting large segmental duplications and deletions that can also predict integer copy numbers for duplicated genes. 

Nat Rev Genet 12:363–376 2. Ventura M, Catacchio CR, Alkan C et al (2011) Gorilla genome structural variation reveals evolutionary parallelisms with chimpanzee. 

Smit AFA, Hubley R, Green P (1996–2004) RepeatMasker Open-3.0 18. Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences.