Hi,
I was wondering if there is any data set that I can use as ground truth to compare which tool behaves better at counting the number of reads within a BAM file for instance. So far, I've ran bcftools mpileup
which is supposedly only a wrapper for samtools mpileup
(which I also ran) and a third tool called bam-readcount
(https://github.com/genome/bam-readcount).
I ran all three tools on the same data set making sure that the quality thresholds/filtering parameters are the same for all three tools. However, when I look at the results and compare the read counts for the same chromosome position (i.e., unique ID based on chr_pos_base
) I see that the counts reported highly vary.
I've attached some boxplots here which show the absolute difference in read counts per reported position for each pair of tools tested. BRC
is bam-readcount
, BCF
is bcftools mpileup
, and SAM
is samtools mpileup
. Performed the analysis on different tissues from the same sample (X-axis). What is intriguing from those plots, is that samtools mpileup
vs. bcftools mpileup
are not agreeing as much as expected. Which is weird, because the bcftools
implementation of mpileup
should be just a wrapper according to this. For instance, the variance in the absolute difference in counts between samtools mpileup
vs. bam-readcount
is a lot smaller.
Any pointers how I can assess which tool is closest to the "truth" are appreciated.
Are you accounting for secondary alignments of reads?
You can load the BAM file read directly in IGV and count them yourself - which in lower coverage areas is very feasible.
Thanks. The problem does not actually rely in the areas with low coverage from what I've seen so far in IGV
@GenoMax To be honest, I am not sure if that is in the default parameter settings of
mpileup
. This are the commands I run forbcftools mpileup
andsamtools mpileup
:So nothing fancy there :|
You use option -B in samtools mpileup, but not in bcftools. Could explain some of the differences.
Good point. I'll run a quick test on only one chromosome and see if/how much that changes the results
A big difference between mpileup and bam-readcount is the default maximum depth reported (250 vs 10000000). There are other differences in the default parameters, with samtools filtering out more reads than bam-readcount in general.
Yes, that I agree. But I would expect
samtools mpileup
andbcftools mpileup
to produce very similar results with the same parameter settings. However, as seen in the plots, actuallybam-readcount
andsamtools
are the closest in read count reportingSorry for a late answer, but I've reached my 5 posts limit and had to wait 6 hours on a Friday evening :D
As Carlo Yague mentioned, I was running
bcftools
without the-B
flag set, as opposed to thesamtools
command. Setting that flag inbcftools
resulted in having identical results forsamtools mpileup
andbcftools mpileup
.However, there are still differences between
bam-readcount
andsamtools mpileup
, even if I match the maximum depth:So I still do not know which tool I should trust. Any further suggestions? Thank you
You can trust both, the difference comes from the fact that samtools/bcftools' defaults do some more read filtering than bam-readcount : discard low quality reads, pairs not properly paired, etc...
bam-readcount is closer to the "naïve" read coverage because by default, it does not filter anything.
That is true. I am now performing some other tests where I am setting the map/base quality to 30 and tweak the read depth. Thank you for your input and for spotting that flag there :]
So I guess the answer is: up to date, there is no ground truth data set for BAM read counts.