Hi all, bioinformatics noob here.
I have been trying to call variants and don't fully understand what the I16 tag's first 4 entries mean. I found this table:
1 #reference Q13 bases on the forward strand 2 #reference Q13 bases on the reverse strand
3 #non-ref Q13 bases on the forward strand 4 #non-ref Q13 bases on the reverse strand
5 sum of reference base qualities 6 sum of squares of reference base qualities
7 sum of non-ref base qualities 8 sum of squares of non-ref base qualities
9 sum of ref mapping qualities 10 sum of squares of ref mapping qualities
11 sum of non-ref mapping qualities 12 sum of squares of non-ref mapping qualities
13 sum of tail distance for ref bases 14 sum of squares of tail distance for ref bases
15 sum of tail distance for non-ref bases 16 sum of squares of tail distance for non-ref
And learnt that Q13 means a base quality of bigger than 13 Phred Score.
But I dont understand why sometimes I am getting variants that seem to be in the entry 2 rather than entry 3. Can someone give an example of when I should expect to see an entry in 2 (reference bases on the reverse strand) rather than in 3 (no-ref bases in the forward strand) or in 4 (no-ref bases in the reverse strand). I have been trying to see how the results match up to my sam file but can't find how they are consistent.
Bonus question:
I am trying to extract the number of 'mutant' paired reads observed per reference genome. Eg. I have a genome with 40 pairs that have no snps,indels etc., 3 pairs that have one indel and 10 pairs that have one snp. Then I want to know that this genome had 13 'mutant' pairs that were read.
In order for me to do this it seems I need 2 things. 1. To be able to first get the number of paired snps/indels from my sam file. bcftools call
and bcftools view
dont seem to be doing this because of the I16 outputs I dont understand.
2. A way to extract this information. bcftools stats
seems to aggregate snps and indels, etc rather than keeping them separated out by reference genome.