Question

I16 Info Tag

0

Entering edit mode

6.0 years ago

tbb21 ▴ 10

Hi all, bioinformatics noob here.

I have been trying to call variants and don't fully understand what the I16 tag's first 4 entries mean. I found this table:

1   #reference Q13 bases on the forward strand  2   #reference Q13 bases on the reverse strand
3   #non-ref Q13 bases on the forward strand    4   #non-ref Q13 bases on the reverse strand
5   sum of reference base qualities 6   sum of squares of reference base qualities
7   sum of non-ref base qualities   8   sum of squares of non-ref base qualities
9   sum of ref mapping qualities    10  sum of squares of ref mapping qualities
11  sum of non-ref mapping qualities    12  sum of squares of non-ref mapping qualities
13  sum of tail distance for ref bases  14  sum of squares of tail distance for ref bases
15  sum of tail distance for non-ref bases  16  sum of squares of tail distance for non-ref

And learnt that Q13 means a base quality of bigger than 13 Phred Score.

But I dont understand why sometimes I am getting variants that seem to be in the entry 2 rather than entry 3. Can someone give an example of when I should expect to see an entry in 2 (reference bases on the reverse strand) rather than in 3 (no-ref bases in the forward strand) or in 4 (no-ref bases in the reverse strand). I have been trying to see how the results match up to my sam file but can't find how they are consistent.

Bonus question: I am trying to extract the number of 'mutant' paired reads observed per reference genome. Eg. I have a genome with 40 pairs that have no snps,indels etc., 3 pairs that have one indel and 10 pairs that have one snp. Then I want to know that this genome had 13 'mutant' pairs that were read. In order for me to do this it seems I need 2 things. 1. To be able to first get the number of paired snps/indels from my sam file. bcftools call and bcftools view dont seem to be doing this because of the I16 outputs I dont understand. 2. A way to extract this information. bcftools stats seems to aggregate snps and indels, etc rather than keeping them separated out by reference genome.

SNP snp sequencing bcftools mpileup • 3.1k views

ADD COMMENT • link updated 6.0 years ago by finswimmer 16k • written 6.0 years ago by tbb21 ▴ 10

score 2 · Accepted Answer · 2018-12-09

Hello tbb21 ,

the output you are talking about, is the output of bcftools mpileup. And this is just the half way down the road to your variants. In this step bcftools collects metrics about each covered position on which bcftools call will decide if it should have a closer look at this position to look for a variant.

So if you have values in the first and second field of of the I16 key, but not in the third or fourth, than you will have no reads supporting a variant at this position and bcftools callwill not take a closer look at this position.

The whole way from your alignment file to you variant file have to look like this:

bcftools mpileup -Ou -f ref.fa aln.bam |  bcftools call -Ov -mv > output.vcf

fin swimmer