CnvDetectD2




Introduction

Copy number variation is an important means of genetic adaptation. It is a rapid means to effect change, typically in gene dosage but also for diversification of gene function. While identification of copy number variants can be performed with QPCR, it is not an efficient method for genome wide de novo identification. High throughput sequencing provides a cost-effective and adaptable means of genome wide genetic variation identification - including SNPs, structural variations, and copy number variation. Read depth coverage of read alignments to a reference genome is a common means of identifying copy number variants from whole genome shotgun sequencing. Our program utilizes this approach.

High throughput sequencing libraries can display a strong bias in sequencing depth related to the GC% content of the read product. The extent of this variation is variable between sequencing libraries of the same isolate.

Method Overview



Our program normalizes read depth on a per read basis for each sequencing library. An expected GC% of reads is calculated from the reference genome, and then the coordinate sorted bam file is analyzed to calculate the correction factors. The correction factors are applied to each read, and a pileup file of the normalized read depth is ultimately produced.


In addition, solely relying on read depth, even when normalized for GC bias, results in significant false positive rates. The following figure compares identified copy number variants from read depth for three separate lab culture isolates. Despite knowing the extent of ture copy number variation in the genomes of these three isolates, the read depth based metric still identifies more copy number variants than should exist.



Therefore to accurately identify copy number variants from read depth, we must normalize the GC bias on a per library basis and reduce the number of false positive calls from a read depth based approach by some independent metric. We use discordant reads to verify the potential copy number variants. Everted reads can be evidence of traversal of a tandem duplication breakpoint, and their alignment with the breakpoint of read depth identified copy number variants is verification of the veracity of the copy number variant. Spanning reads are reads that span the deleted region, and in the reference genome are large read products that align with the identified deletion breakpoints by read depth. The following figure outlines the whole process:






Download and Installation

The program can be download as a tar gzip file on the baileylab webserver.

The latest version is 1.2:  baileylab.umassmed.edu/Software/CnvDetectD2_v1_2.tar.gz

Usage

The following dependencies are guaranteed to work, but newer or older versions are likely to work.
Requirement:
python 2.7
numpy 1.7.1
scipy 0.13.3
pysam 0.7.4 (samtools version 0.1.18)
Biopython 1.61

The bias correction process is a multi-step command line process. All commands are contained within the singular program bias_correction.py

python bias_correction.py

Program:  CnvDetectD2.pl
Version: 1.2

#########################################################
Usage:   CnvDetectD2.pl <command>

Command: prodgc      creates cfile by product gc
         readgc      creates cfile by read gc
         pileup      converts cfile to pileup (as fasta)
         stats      statistics for wig files
         wig      converts pileup to wig file
         expected   creates expected file
         cfile2bam   appends cfile values directly to bam
         merge      merges pileup files
         segmentation   segments pileup to identify CNVs
         pe      paired end methods
         intersect   pe method intersections
         help      help screen
#########################################################

Likewise usage information for a command or subcommand can be seen by executing without any arguments. Example Workflow Below are the sequential commands for a product side correction:
### 01 produce expected file for the reference genome
python CnvDetectD2.py expected prodgc --prefix your_genome.expected  \
    your_genome.fasta
### 02 create GC correction file based on individual library 
python CnvDetectD2.py prodgc -p seq_lib -f your_genome.fasta \
    -e your_genome.expected seq_lib.bam
### 03 correct pileup based and weighting reads
python CnvDetectD2.py pileup -p seq_lib seq_lib.cfile seq_lib.bam
### 04 generate wig file of read depth (input for mean shift algorithm)
python CnvDetectD2.py wig -p seq_lib seq_lib.pileup
### 05 segment and detect copy number variants with mean shift
python CnvDetectD2.py segmentation meanshift -p seq_lib  \
    -f your_genome.fasta seq_lib.wig
###06 detect discordant read pairs that are inverted
python CnvDetectD2.py pe inverted -b -n seq_lib.inverted.bed seq_lib.bam
###07 detect discordant read pairs that span deletions
python CnvDetectD2.py pe spanning -b -n seq_lib.spanning.bed seq_lib.bam
##08  merge the spanning and inverted discordant pairs
python CnvDetectD2.py intersect -s seq_lib.spanning.bed \
    -i seq_lib.inverted.bed seq_lib.bed > seq_lib.intersect.bed