Question

Parsing Base String From Samtools Mpileup

1

Entering edit mode

11.4 years ago

Noushin N ▴ 600

Hello everyone,

I am trying to construct B-allele frequency plots from NGS data; and for that I am using samtools mpileup output at positions of interest.

Unfortunately, I am not able to establish a one-to-one relationship between the bases in base string column [column 4], and the base quality string [column 5] for lines with indels.

Here is an example from samtools sourceforge page:

seq2 156 A 11  .$......+2AG.+2AG.+2AGGG    <975;:<<<<<

According to columns 4 and 6, there are 11 reads aligned to this position. But when trying to parse the base string column, I get

9 reads aligned to reference base (A)
3 reads with an insertion of AG
2 reads with base G

Can anyone please explain what is going on?

Thank you in advance!

samtools mpileup • 7.5k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 11.4 years ago by Noushin N ▴ 600

2

Entering edit mode

The only possible explanation I can think of is the fact that ".+2AG" accounts for an insertion of "AG" bases, as opposed to "+2AG". This way, the numbers match up.

.|.|.|.|.|.|.+2AG|.+2AG|.+2AG|G|G

ADD REPLY • link 11.4 years ago by Noushin N ▴ 600

1

Entering edit mode

I think you are correct. I've always interpreted insertions as something that's between the current base and the next base. So that means there is still a match to the current base.

ADD REPLY • link 11.4 years ago by Damian Kao 16k

0

Entering edit mode

Thank you for the information and the pointer to your script. I have also come across the following example in one of my mpileups:

chr3    128988392       G       110
c$CCCCCC.CCcccC,Cc.,C,C...c.C.C...cccCcCcCC..cC..,...CCCCCcccC.CCC.C,CCCC,CCcCccC.cC,C,c.CccCcC,CCc.CCcCC,+1ccc+1c.c^],

Here, there is an instance of ",+1c" and one of "+1c". What is your interpretation for each of them?

Thanks again

ADD REPLY • link 11.4 years ago by Noushin N ▴ 600

1

Entering edit mode

",+1c" is a match on the opposite strand and a single insertion of c between the current base and next base. "+1c" should probably be "c+1c" meaning there is a read with a mismatch c in the current position and a single insertion of c between the current base and next base.

ADD REPLY • link 11.4 years ago by Damian Kao 16k

2

Entering edit mode

I wrote a script for this a while back. I am not sure if it works 100% correctly though. You should definitely spot check it: http://blog.nextgenetics.net/?e=56

ADD REPLY • link 11.4 years ago by Damian Kao 16k

Ram · Answer 1 · 2014-03-21

0

Entering edit mode

10.7 years ago

Noushin N ▴ 600

Since no completely correct answer was posted to this question, I would like to point out that Damian's script looks pretty functional. After spot checking, it looks to me that it does not catch the ',' or '.' bases preceding an insertion/deletion. Thus, it will overestimate the number of reference bases at such sites.

I wrote a quick parser, which corrects the issue above and is accessible at: https://github.com/noushin6/NGSTools/blob/master/baseParser.py

I have spot checked it quite a bit; but feel free to do so yourself and use at your own risk.

ADD COMMENT • link 10.7 years ago by Noushin N ▴ 600

0

Entering edit mode

@ Noushin N Thanks a lot

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by konstantinkul ▴ 110

score 0 · Answer 2 · 2016-07-12

The "+2AGGG" in the last part of the pileup string should be interpreted as "+2AG", "G", "G". The "+2" indicates that only the following two nucleotides are part of the insertion. The last 2 "G" characters indicate two variant "G" nucleotides and are not part of the insertion string. If you view it this way, everything matches up.

Edit: Noticed this has been addressed in the comments section by author of original post