Hello -
I was wondering if anyone else has noticed this wierd phenomena using the Samtools mpileup command:
I am running mpileup on a small test BAM file, looking at chr 3, position 72880191. (The test BAM file is linked publicly here: https://dl.dropbox.com/u/11626982/test.bam.)
In one scenario, I am using the -l option for a list of variants in BED format, of which I have a list of only one:
~$ cat one_site.bed
3 72880190 72880191 A G
~$ samtools mpileup -f all_sequences.fa -l one_site.bed test.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
3 72880191 A 171 .$.$..,.............GG.............,,g.GG.G.G,gG.........,gGg...G.G.G.G...G.,.,.,.....,.,,......,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,gg,gggg,gggggg,gg,gg,,.,,,,G,,GggggG,G,,gGg^],^],^]g^]g DDDDF@DDBDDBDDB>>@DBDDDDDB>DBBDBBHH9DDDBDD?HGDDD>DDD<D5FDE@DBDGDABB<EDADGCFCDBFD9DD#F#DJDDD?BBDJHDJHJDJDJDDBEDDIDDJDCDDCDFDB;;D9:>AD9;9;>3D?(@>3'D?D>@DDD>#>A8@?0@D<9<CDD#C
(Please note that all_sequences.fa here is the NCBI-human-build36 reference fasta.)
In a second scenario, I am using the -r option for listing the position of interest:
~$ samtools mpileup -f all_sequences.fa -r 3:72880191-72880191 test.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
3 72880191 A 171 .$.$..,.............GG.............,,g.GG.G.G,gG.........,gGg...G.G.G.G...G.,.,.,.....,.,,......,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,gg,gggg,gggggg,gg,gg,,.,,,,G,,GggggG,G,,gGg^],^],^]g^]g $111%1.10!11!,1'!3!!!26-/!*)$(-$3)!!8!!!!,!!"!'!'%-+!&$!#!#$,$!!#$!!!-#,!!%+'$0)!!!#%!6:!!!!(!7-?8,;74.6(538!33'57-5-530162-%)3)(%!3*!-'$)3#(-!)'313733!33#%!*#%0$39)'!44#!
What I am trying to figure out is - Why are the quality strings so different when everything else in the output of both scenarios is identical?
Thank you for any help!
Update - Found this on SeqAnswers:
"it looks like the issue is that if I don't specify the -r flag, samtools does not use BAQ computation."
Is there any way to make sure BAQ computation remains ON when using the -l flag?
Thanks!
Hey Cyriac, I've heard before that BAQ artifacts are possible at the end of bed-regions (that's expected I guess), however, this is the first time I hear BAQ is completely disabled when using a bed file as input. Do you remember where you read that 'BAQ computation was incorrectly disabled for both -r and -l' and that someone 'fixed the problem for -r in later versions'. I'm also affected by this problem and can't actually find the cause in samtools' source code.
I've edited my answer above to clarify that those were presumptions based on my testing with different samtools versions. The "fix for -r in later versions" could have been an unintentional change.
But you make a point that I wasn't aware of - that artifacts are possible at the end of bed-regions. Is this because BAQ is based on the quality of the surrounding alignments? But it seems to work correctly for -r, so there is no reason for it to behave differently with -l. I tested out this theory by adding a few flanking bps around 72880190-72880191 in the OP's BED file, and BAQ computation was properly calculated for the single site in question. So it must be some kind of exception at the ends of BED regions, like you said.