Hello everyone,
I am trying to construct B-allele frequency plots from NGS data; and for that I am using samtools mpileup output at positions of interest.
Unfortunately, I am not able to establish a one-to-one relationship between the bases in base string column [column 4], and the base quality string [column 5] for lines with indels.
Here is an example from samtools sourceforge page:
seq2 156 A 11 .$......+2AG.+2AG.+2AGGG <975;:<<<<<
According to columns 4 and 6, there are 11 reads aligned to this position. But when trying to parse the base string column, I get
- 9 reads aligned to reference base (A)
- 3 reads with an insertion of AG
- 2 reads with base G
Can anyone please explain what is going on?
Thank you in advance!
The only possible explanation I can think of is the fact that ".+2AG" accounts for an insertion of "AG" bases, as opposed to "+2AG". This way, the numbers match up.
.|.|.|.|.|.|.+2AG|.+2AG|.+2AG|G|G
I think you are correct. I've always interpreted insertions as something that's between the current base and the next base. So that means there is still a match to the current base.
Thank you for the information and the pointer to your script. I have also come across the following example in one of my mpileups:
Here, there is an instance of ",+1c" and one of "+1c". What is your interpretation for each of them?
Thanks again
",+1c" is a match on the opposite strand and a single insertion of c between the current base and next base. "+1c" should probably be "c+1c" meaning there is a read with a mismatch c in the current position and a single insertion of c between the current base and next base.
I wrote a script for this a while back. I am not sure if it works 100% correctly though. You should definitely spot check it: http://blog.nextgenetics.net/?e=56