CnvDetectD2
Introduction
Copy number variation is an important means of genetic adaptation. It is a rapid means to effect change, typically in gene dosage but also for diversification of gene function. While identification of copy number variants can be performed with QPCR, it is not an efficient method for genome wide de novo identification. High throughput sequencing provides a cost-effective and adaptable means of genome wide genetic variation identification - including SNPs, structural variations, and copy number variation. Read depth coverage of read alignments to a reference genome is a common means of identifying copy number variants from whole genome shotgun sequencing. Our program utilizes this approach.High throughput sequencing libraries can display a strong bias in sequencing depth related to the GC% content of the read product. The extent of this variation is variable between sequencing libraries of the same isolate.
Method Overview
Our program normalizes read depth on a per read basis for each sequencing library. An expected GC% of reads is calculated from the reference genome, and then the coordinate sorted bam file is analyzed to calculate the correction factors. The correction factors are applied to each read, and a pileup file of the normalized read depth is ultimately produced.
Therefore to accurately identify copy number variants from read depth, we must normalize the GC bias on a per library basis and reduce the number of false positive calls from a read depth based approach by some independent metric. We use discordant reads to verify the potential copy number variants. Everted reads can be evidence of traversal of a tandem duplication breakpoint, and their alignment with the breakpoint of read depth identified copy number variants is verification of the veracity of the copy number variant. Spanning reads are reads that span the deleted region, and in the reference genome are large read products that align with the identified deletion breakpoints by read depth. The following figure outlines the whole process:

Download and Installation
The program can be download as a tar gzip file on the baileylab webserver.The latest version is 1.2: baileylab.umassmed.edu/Software/CnvDetectD2_v1_2.tar.gz
Usage
The following dependencies are guaranteed to work, but newer or older versions are likely to work.
The bias correction process is a multi-step command line process. All commands are contained within the singular program bias_correction.py
Likewise usage information for a command or subcommand can be seen by executing without any arguments. Example Workflow Below are the sequential commands for a product side correction:
Requirement: python 2.7 numpy 1.7.1 scipy 0.13.3 pysam 0.7.4 (samtools version 0.1.18) Biopython 1.61
The bias correction process is a multi-step command line process. All commands are contained within the singular program bias_correction.py
python bias_correction.py Program: CnvDetectD2.pl Version: 1.2 ######################################################### Usage: CnvDetectD2.pl <command> Command: prodgc creates cfile by product gc readgc creates cfile by read gc pileup converts cfile to pileup (as fasta) stats statistics for wig files wig converts pileup to wig file expected creates expected file cfile2bam appends cfile values directly to bam merge merges pileup files segmentation segments pileup to identify CNVs pe paired end methods intersect pe method intersections help help screen #########################################################
Likewise usage information for a command or subcommand can be seen by executing without any arguments. Example Workflow Below are the sequential commands for a product side correction:
### 01 produce expected file for the reference genome python CnvDetectD2.py expected prodgc --prefix your_genome.expected \ your_genome.fasta ### 02 create GC correction file based on individual library python CnvDetectD2.py prodgc -p seq_lib -f your_genome.fasta \ -e your_genome.expected seq_lib.bam ### 03 correct pileup based and weighting reads python CnvDetectD2.py pileup -p seq_lib seq_lib.cfile seq_lib.bam ### 04 generate wig file of read depth (input for mean shift algorithm) python CnvDetectD2.py wig -p seq_lib seq_lib.pileup ### 05 segment and detect copy number variants with mean shift python CnvDetectD2.py segmentation meanshift -p seq_lib \ -f your_genome.fasta seq_lib.wig ###06 detect discordant read pairs that are inverted python CnvDetectD2.py pe inverted -b -n seq_lib.inverted.bed seq_lib.bam ###07 detect discordant read pairs that span deletions python CnvDetectD2.py pe spanning -b -n seq_lib.spanning.bed seq_lib.bam ##08 merge the spanning and inverted discordant pairs python CnvDetectD2.py intersect -s seq_lib.spanning.bed \ -i seq_lib.inverted.bed seq_lib.bed > seq_lib.intersect.bed