I am currently working on influenza virus and ebola virus. I have 45 virus samples, so I have 45 bam files aligned with the influenza reference genome.fa.
java -Xmx16g -Djava.io.tmpdir=$out_folder/tmp -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-nt 12 \
-dcov 10000 \
-glm BOTH \
-R influenza.fa \
-l INFO \
-o A_California_Influenza_Virus.raw.vcf \
--sample_ploidy 1 \
$INPUT_BAM_FILES
I got the raw VCF file (A_California_Influenza_Virus.raw.vcf
) for 45 samples in the single VCF. I have 1400 VCF records in the raw VCF file.
As per the GATK best practice pipeline research paper, I applied hard filtering option for small datasets.
Is my VCF records small to go for hard filtering?
Then I selected snps alone in a separate VCF file.
java \
-jar /data1/software/gatk/current/GenomeAnalysisTK.jar \
-T SelectVariants \
-R A_California_Influenza_Virus_H1N1.fa \
-V A_California_Influenza_Virus.raw.vcf \
-selectType SNP \
-o VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf
Then I applied hard filtering for SNPs.
java \
-jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R A_California_Influenza_Virus_H1N1.fa \
-V VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "myfilter1" \
-o VariantFiltering/A_California_Influenza_Virus.filtered.snps.vcf
I understand that the variants matching the above conditions are bad variants.
What do the following mean:
- QD < 2.0
- FS > 60.0
- MQ < 40.0
- MQRankSum < -12.5
- ReadPosRankSum < -8.0
What is the threshold value of high confidence variants for QD
, FS
, MQ
, MQRankSum
, ReadPosRankSum
, DP
?