What is the name of the book?

Nat Rev Genet 12:363–376 2. Ventura M, Catacchio CR, Alkan C et al (2011) Gorilla genome structural variation reveals evolutionary parallelisms with chimpanzee.

What is the name of the program?

Smit AFA, Hubley R, Green P (1996–2004) RepeatMasker Open-3.0 18. Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences.

(Open Access) Whole-Genome Shotgun Sequence CNV Detection Using Read Depth. (2018) | Fatma Kahveci

Q: What have the authors contributed in "Whole-genome shotgun sequence cnv detection using read depth" ?

Here the authors provide a guideline for a user friendly tool for detecting large segmental duplications and deletions that can also predict integer copy numbers for duplicated genes.

Derek M. Bickhart (ed.), Copy Number Variants: Methods and Protocols, Methods in Molecular Biology, vol. 1833,

https://doi.org/10.1007/978-1-4939-8666-8_4, © Springer Science+Business Media, LLC, part of Springer Nature 2018

Chapter 4

Whole-Genome Shotgun Sequence CNV Detection

Using Read Depth

Fatma Kahveci and Can Alkan

Abstract

With the developments in high-throughput sequencing (HTS) technologies, researchers have gained a

powerful tool to identify structural variants (SVs) in genomes with substantially less cost than before. SVs

can be broadly classiﬁed into two main categories: balanced rearrangements and copy number variations

(CNVs). Many algorithms have been developed to characterize CNVs using HTS data, with focus on dif-

ferent types and size range of variants using different read signatures. Read depth (RD) based tools are

more common in characterizing large (>10 kb) CNVs since RD strategy does not rely on the fragment size

and read length, which are limiting factors in read pair and split read analysis. Here we provide a guideline

for a user friendly tool for detecting large segmental duplications and deletions that can also predict integer

copy numbers for duplicated genes.

Key words Copy number variation, Whole genome shotgun sequencing, Read depth, mrFAST,

mrsFAST

Abbreviations

CNV Copy number variation

mrCaNaVaR Micro read Copy Number Variant Regions

mrFAST Micro read Fast Alignment Search Tool

mrsFAST Micro read substitution only Fast Alignment Search Tool

RD Read depth

TRF Tandem repeat ﬁnder

WGS Whole genome sequencing

1 Introduction

CNVs are changes in the amount of DNA in a genome that show

themselves as duplication (gain) and deletion (loss) events [1].

Traditionally, a copy number variant is deﬁned as a gain or loss in

the amount of DNA in the genome that is larger than 1 Kb,

although smaller deletions or duplications (>50 bp) are considered

CNV in several projects such as the 1000 Genomes Project. CNVs

are known to have played major role in evolution [2–4], and sev-

eral of them are correlated with complex human disease [5, 6].

There exist many algorithms to characterize CNVs in genomes

that employ different strategies to characterize different classes of

CNVs in different size ranges. Here we describe the mrCaNaVaR

algorithm [7] that uses read depth information to predict large

(>10 Kb) duplications and deletions, along with integer copy num-

bers for genes. mrCaNaVaR was the ﬁrst tool developed solely for

accurate assessment of segmental duplications on personal

genomes, and it also was the only publicly available tool for integer

copy number prediction. Although it was initially developed to

analyze only human segmental duplications, it has been used to

characterize large CNVs in many organisms including cattle [8],

great apes [3, 9], Neandertal [10] and Denisovan [11, 12] ancient

DNA, and even plant genomes [13].

Here we describe how to use mrCaNaVaR for detecting large

CNVs. Since segmental duplication prediction necessitates track-

ing multiple map locations of reads, we need to use a multi-mapper

[14]. Brieﬂy, after mapping all reads to a repeat masked genome

(to “clean up” common repeats) using mrFAST or mrsFAST

[15, 16], mrCaNaVaR loads the mapping information (i.e., SAM

ﬁles) and applies a sliding window strategy to count read depth.

The read depth distribution is often affected by G + C content of

genomic segments, therefore mrCaNaVaR applies a statistical

smoothing method (LOESS) to correct for bias in high and low

G + C regions. It then identiﬁes those regions with higher or lower

read depth than the genome- wide average (±3 standard devia-

tions), and chains consecutive such regions. mrCaNaVaR also pre-

dicts integer copy numbers of windows of size 1 Kb as the ratio of

observed read depth to average read depth, and uses this informa-

tion to reﬁne CNV predictions and the estimated breakpoints.

2 Materials

This section contains a list of prerequisite software tools that

mrCaNaVaR requires. All tools described below are open source

and free to use under the GNU public license.

1. Download the zlib program from https://sourceforge.net/

projects/libpng/ﬁles/zlib/. An example of a Linux command

to use is as follows (see Notes 1, 2, 3, and 4):

$ wget https://sourceforge.net/projects/libpng/ﬁles/zlib/

X.X.X/zlib-X.X.X.tar.gz

where X.X.X denotes version number.

2.1 Installing Zlib

Fatma Kahveci and Can Alkan

2. Unpack the downloaded ﬁle using the command shown below:

$ tar -xzvf zlib-X.X.X.tar.gz

where X.X.X denotes version number.

There should be a new folder named zlib-X.X.X in the current

working directory.

3. Remove the zipped version of ﬁle:

$ rm zlib-X.X.X.tar.gz

4. Change directory into zlib-X.X.X:

$ cd zlib-X.X.X

5. Type at the shell prompt:

$ ./conﬁgure

$ make test

If everything goes well, then type:

$ sudo make install

Depending on the version of the GCC compiler and some

additional library requirements, the installation should be com-

pleted without any errors (see Note 5).

1. Download the Blast program from ftp://ftp.ncbi.nlm.nih.

gov/blast/executables/LATEST/. An example of a Linux

command to use is as follows:

$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/

LATEST/ncbi-blast-X.X.X+-x64-linux.tar.gz

where X.X.X denotes version number.

2. Unpack the downloaded ﬁle using the command shown below:

$ tar -xzvf ncbi-blast-X.X.X+-x64-linux.tar.gz

where X.X.X denotes version number.

There should be a new folder named ncbi-blast-X.X.X+ in the

current working directory.

3. Remove the zipped version of ﬁle:

$ rm ncbi-blast-X.X.X+-x64-linux.tar.gz

4. Change directory into ncbi-blast-X.X.X+/bin:

$ cd ncbi-blast-X.X.X+/bin

5. Download ncbi-rmblastn from http://ftp.ncbi.nlm.nih.gov/

blast/executables/rmblast/.

$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/rmblast/

X.X.X/ncbi-rmblastn-X.X.X-x64-linux.tar.gz

where X.X.X denotes version number.

6. Unpack the downloaded ﬁle using the command shown below:

$ tar -xzvf ncbi-rmblastn-X.X.X-x64-linux.tar.gz

2.2 Installing Blast

WGS CNV Detection Using RD

where X.X.X denotes version number.

There should be a new folder named ncbi-rmblastn-X.X.X in

the current working directory.

7. Copy rmblastn ﬁle into /path/ncbi-blast-X.X.X+/bin

directory:

$ cp /path/ncbi-rmblastn-X.X.X/rmblastn /path/ncbi-blast-

X.X.X+/bin

8. Remove unused ﬁles in ncbi-blast-X.X.X+/bin directory:

$ rm ncbi-rmblastn-X.X.X*

1. Change directory into RepeatMasker:

$ cd /path/RepeatMasker

2. Download “trf” ﬁle from http://tandem.bu.edu/trf/trf409.

linux64.download.html.

3. Convert binary ﬁle into executable:

$ chmod 755 trf409.linux64

4. Change ﬁle name:

$ mv trf409.linux64 trf

1. Download the RepeatMasker program from http://www.

repeatmasker.org. An example of a Linux command to use is as

follows:

$ wget http://www.repeatmasker.org/RepeatMasker-open-

X-X-X.tar.gz

where X-X-X denotes version number.

2. Unpack the downloaded ﬁle using the command shown below:

$ tar -xzvf RepeatMasker-open-X-X-X.tar.gz

There should be a new folder named RepeatMasker in the cur-

rent working directory.

3. Remove the zipped version of ﬁle:

$ rm RepeatMasker-open-X-X-X.tar.gz

4. Change directory into RepeatMasker:

$ cd RepeatMasker

5. RepeatMasker provides two open databases, Dfam and Dfam_

consensus, and will work with these datasets, but it is advised

to obtain a license for the RepBase RepeatMasker Edition to

supplement these sequences. To obtain a license and download

the library go to http://www.girinst.org. Copy RepBase into

the RepeatMasker directory:

$ cp RepBaseRepeatMaskerEdition-XXXXXXXX.tar.gz /

path/RepeatMasker/

2.3 Installing

Tandem

Repeats Finder

2.4 Installing

RepeatMasker

Fatma Kahveci and Can Alkan

6. Change directory into:

$ cd /path/RepeatMasker

7. Unpack the downloaded ﬁle using the command shown below:

$ tar -xzvf RepBaseRepeatMaskerEdition-########.tar.gz

There should be a new folder named Libraries in the current

working directory.

8. Remove the zipped version of ﬁle:

$ rm RepBaseRepeatMaskerEdition-########.tar.gz

9. Check for Dfam Updates (optional).

(a) Change directory into Libraries:

(b) Download the Dfam.hmm.gz ﬁle from http://www.dfam.

org.

$ wget http://www.dfam.org/web_download/Current_

Release/Dfam.hmm.gz

below:

$ gunzip Dfam.hmm.gz

There should be a new ﬁle named Dfam.hmm in the current

working directory.

10. Change directory into RepeatMasker:

$ cd /path/RepeatMasker/

11. Type at the shell prompt (see Note 6):

$ perl ./conﬁgure

12. Follow the instructions.

1. Download the mrFAST program from https://github.com/

BilkentCompGen/mrfast/. An example of a Linux command

to use is as follows (see Notes 1, 2, 3, and 4):

$ wget https://github.com/BilkentCompGen/mrfast/

archive/vX.X.X.X.zip

where X.X.X.X denotes version number.

2. Unpack the downloaded ﬁle using the command shown below:

$ tar -xzvf mrfast-X.X.X.X.tar.gz

where X.X.X.X denotes version number. There should be a

new folder named mrfast-X.X.X.X in the current working

directory.

3. Remove the downloaded compressed version of the ﬁle:

$ rm mrfast-X.X.X.X.tar.gz

where X.X.X.X denotes version number.

4. Change directory into mrfast-X.X.X.X:

2.5 Installing

mrFAST

WGS CNV Detection Using RD

Whole-Genome Shotgun Sequence CNV Detection Using Read Depth.

Figures

Citations

Benchmarking challenging small variants with linked and long reads

Benchmarking challenging small variants with linked and long reads

CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data

Paralog-specific gene copy number discovery within segmental duplications

Breakpoint refinement of genomic structural variation using split read analysis

References

Tandem repeats finder: a program to analyze DNA sequences

A Draft Sequence of the Neandertal Genome

A high-coverage genome sequence from an archaic Denisovan individual

Genetic history of an archaic hominin group from Denisova Cave in Siberia

Genome structural variation discovery and genotyping

Related Papers (5)

A Deep Learning Approach for Detecting Copy Number Variation in Next-Generation Sequencing Data.

iCopyDAV: Integrated platform for copy number variations—Detection, annotation and visualization

CNV-TV: A robust method to discover copy number variation from short sequencing reads

GROM-RD: resolving genomic biases to improve read depth detection of copy number variants

CNVkit: Copy number detection and visualization for targeted sequencing using off-target reads

Frequently Asked Questions (3)

Q1. What have the authors contributed in "Whole-genome shotgun sequence cnv detection using read depth" ?

Q2. What is the name of the book?

Q3. What is the name of the program?