Hello.
I'm reading some posts and tutorials but I'm still with doubts how to decide a value for quality threshold for snps in VCF files.
Right now, I'm using samtools for variant calling and the bcftools to generate the vcf files. I'm trying to do this with a sequencing data from Mycobacterium bovis, a bacteria that cause the bovine tuberculosis.
I'm generating my vcf files with these two commands lines
samtools mpileup -g -f Mycobacterium_bovis_AFF2122_97/NC_002945.fna MB534_mapped_sorted.bam > MB534_variants.bcf
bcftools call -c -v MB534_variants.bcf > MB534_variants.vcf
Ok, now with the vcf file, I'm thinking how to filter them.
The first thing that came to me was to check IDV values for insertions and deletions and DP for mutations. These measures seem pretty intuitive, since you see a loot of read supporting what you see, which help to believe that a snp is not random event right?
After some research on the Internet, I got particularly interested in this tutorial from Samtools.
As I had another result of variant calling for the same sample, so I could use the second suggestion and compare with my results to get a quality value, and also the result of the ratio of transitions and transversions gave a similar value o quality threshold, just a little lower (saw on some paper that it should be around 2 and 2.1, but I got a paper on M bovis that report 2.06)
But there is so many results, a lot of Mann-Whitney U test, and many other things, where do I read about then, or get examples how to use them to filter if I should believe or not in that snps? MQ and FQ values seems pretty interesting.
ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads"
ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same"
Everything I read don't touch on these things, but they should be there for a reason. On the header they say (bigger is better), (smaller is better), but I don't know which number is big and which one is small.
So, my doubts are:
- Could someone guide me to some tutorial on how to decide a quality value threshold for bacteria genome?
- Should I get trouble for just use IDV and DP and transitions and transversions ratio as criteria to control the vcf quality results?
- Are there magic numbers for the others criteria on variant calling or it will always depends?