(This has been x-posted to the GATK forum.)
I have a file called rawSnps.vcf, which I've been trying to filter through the VariantFiltration Walker. The command I've been using is as follows:
/usr/java/latest/bin/java \
-Xmx2g \
-jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R /home/rothlab/agopal/bin/bowtie2-2.2.3/indexes/s_cerevisiae.fa \
-V Papi_C02_1/rawSnps.vcf \
--filterExpression "QD < 2 || FS > 60 || MQ < 40 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8" \
--filterName "snpfilter" \
-o Papi_C02_1/filtered_snps.vcf -allowPotentiallyMisencodedQuals
I noticed that this wasn't working, so I switched to a more simple command, like this:
/usr/java/latest/bin/java \
-Xmx2g \
-jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R /home/rothlab/agopal/bin/bowtie2-2.2.3/indexes/s_cerevisiae.fa \
-V rawSnps.vcf \
--filterExpression "MQ < 30" \
--filterName "snpfilter" \
-o filtered_snps_test.vcf \
-allowPotentiallyMisencodedQuals
However, that isn't working either.
The output I get from GATK looks like this:
/usr/java/latest/bin/java -Xmx2g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T VariantFiltration -R /home/rothlab/agopal/bin/bowtie2-2.2.3/indexes/s_cerevisiae.fa -V rawSnps.vcf --filterExpression "MQ < 30" --filterName "snpfilter" -o filtered_snps_test.vcf -allowPotentiallyMisencodedQuals
INFO 10:03:14,434 HelpFormatter - --------------------------------------------------------------------------------
INFO 10:03:14,436 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
INFO 10:03:14,437 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 10:03:14,437 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 10:03:14,442 HelpFormatter - Program Args: -T VariantFiltration -R /home/rothlab/agopal/bin/bowtie2-2.2.3/indexes/s_cerevisiae.fa -V rawSnps.vcf --filterExpression MQ < 30 --filterName snpfilter -o filtered_snps_test.vcf -allowPotentiallyMisencodedQuals
INFO 10:03:14,445 HelpFormatter - Executing as agopal@guru-int.mshri.on.ca on Linux 2.6.32-431.11.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13.
INFO 10:03:14,445 HelpFormatter - Date/Time: 2015/05/04 10:03:14
INFO 10:03:14,445 HelpFormatter - --------------------------------------------------------------------------------
INFO 10:03:14,445 HelpFormatter - --------------------------------------------------------------------------------
INFO 10:03:15,216 GenomeAnalysisEngine - Strictness is SILENT
INFO 10:03:15,308 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 10:03:15,439 GenomeAnalysisEngine - Preparing for traversal
INFO 10:03:15,440 GenomeAnalysisEngine - Done preparing for traversal
INFO 10:03:15,440 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 10:03:15,441 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 10:03:15,441 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 10:03:17,682 ProgressMeter - done 7942.0 2.0 s 4.7 m 99.9% 2.0 s 0.0 s
INFO 10:03:17,682 ProgressMeter - Total runtime 2.24 secs, 0.04 min, 0.00 hours
INFO 10:03:18,554 GATKRunReport - Uploaded run statistics report to AWS S3
i.e., you there is no error report. However, when I manually look at filtered_snps_test.vcf, I can see that many SNPs that have a MQ value below 30 (or 40, if that's what I've used) have still been passed in the filter.
I've also done a raw line-count of rawSnps.vcf and filtered_snps_test.vcf, and starting at the #CHROM position, both line counts are the same.
I'm not sure what's going on here. You can see both files here:
rawSnps.vcf
: https://www.dropbox.com/s/868kkpna5n315r0/filtered_snps_test.vcf?dl=0
filtered_snps_test.vcf
: https://www.dropbox.com/s/868kkpna5n315r0/filtered_snps_test.vcf?dl=0
Any help would be much appreciated.
Thanks!
MappingQualityRankSum is missing from your VCF header.
try removing this field from your test?