MinorAlleleCatcher

Minor allele catcher filters out reads based on attributes that could contribute to false base calling. The filtered reads left are then used to inform on the frequency of particular SNPs in a pooled sequencing library comprised of multiple independent samples.

Taylor, S. M. et al. Absence of putative Plasmodium falciparum artemisinin resistance mutations in sub-Saharan Africa: A molecular epidemiologic study.  J. Infect. Dis. 211:680-8 (2015).

Introduction

The aim of this program is to take a sequencing library of pooled DNA isolates and identify the frequency of SNPs in the population on a per loci basis. This is accomplished by filtering out numerous possible contributing factors to false positive SNP calls in the data set. For example, increasing the minimum read quality at a position to be considered for SNP calling drastically reduces the number of superfluous SNP calls of low frequency SNPs.

Read quality is not the only metric used. In total, map quality, read quality, alignment score (calculated by bowtie2), buffer from read begin/end, and read size of Iontorrent sequencing reads were used for minimum criteria for potential SNP consideration. This resulted in the following number of bases filtered for failing minimum criteria:


When comparing our filters to known frequencies of a particular SNP, we are able to accurately capture the known frequency.


Between the above reduction in false positives and the accurate frequency count, this program is capable of accurately, sensitively, and specifically identify SNP frequency in a pooled high throughput sequencing library.

Usage


Requirements:
Note: version numbers for dependencies are the ones on which the program was built. Older or newer versions may work.
numpy 1.8.2
scipy 0.13.3
pysam 0.7.4 (Samtools 0.1.18)

Usage:
Usage: minor_allele_catcher [options]<ref.fasta> <in.sorted.bam>

If no default given, either false or 0
Options: -q/--qual           minimum read qual [default: 10]
         -m/--mapq           minimum map qual [default: 10]
         -a/--aln_score      minimum alignment score [default: None]
         -e/--ends           minimum distance from read ends
         -d/--depth          max depth for pysam.pileup [default: 100000
         -r/--rmdup          remove duplicates
         -n/--nearq          minimum distance near snp of passing quality
         -s/--size           minimum read size
         -b/--bothstrands    filter if strand bias [default: False]
         -c/--cumulative     count filter cumulatively [default: False]
         -i/--indel          skip indel reads [default: False]
         --refrelative       compare snps to ref rather than major allele [default: false]
         -p/--primersize     size of primer
         -l/--lower          minimum alignment range
         -u/--upper          maximum alignment range
         -o/--out            output filename
         -h/--help           help screen

Note: alignment score is a scoring calculation produced by bowtie2.
In Taylor et al. 2014, the following options were used:

python minor_allele_catcher.py -q 34 -m 10 -e 10 -s 200 -a 81 -b input.sorted.bam

Output

Printed to standard out are the SNP loci by line and the frequency and filters in a tab-delimited column based format. In addition, per loci filter counts of reads are included in the tab delimited format. By default, this is non-cumulative count, and read filtrations are performed in order: indel, map quality, read size, alignment score, read quality, neighboring quality, read ends, optical duplicates.

Column legend:
chrom -- chromosome
pos -- 0-based reference position
ref_allele -- reference allele
major_allele -- major allele (can be different from reference allele)
minor_allele -- minor allele
filtered_depth -- depth of reads at base position after filtration
minor_allele_freq -- frequency of minor allele
major_f -- number of reads with major allele on forward strand
major_r -- number of reads with major allele on reverse strand
minor_f -- number of reads with minor allele on forward strand
minor_r -- number of reads with minor allele on reverse strand
unfiltered_depth -- depth of reads before filtration of reads
mapq_f -- number of reads filtered by map quality
read_size_f -- number of reads filtered by read size
readq_f -- number of reads filtered by read quality
neighbor_f -- number of reads filtered by neighboring loci read quality
read_ends_f -- number of reads filtered for being at read ends
indel_f -- number of reads filtered for having indels (insertions/deletions)
bias_f -- number of reads filtered for having a strand bias
as_f -- number of reads filtered for alignment score
dup_f -- number of reads filtered for being optical duplicates
total_median_depth -- deprecated
minor_median_depth -- deprecated