I am getting different counts for the number of bases on reference covered by aligned reads using samtools depth/mpileup and BEDTools genomeCoverageBed commands. I am using samtools-0.1.19 and bedtools-2.17.0
If there are any spliced (CIGAR ~ /N/) alignments in your BAM file, bedtools will count the bases in the spiced region by default. To prevent this, use the -split option.
I think you should follow Heng's advice and look at discrepant positions with IGV or the like. I did this myself and found positions where bedtools reports coverage that is higher than samtools. When I look in IGV, the alignments support the bedtools counts, but that is an N=1 experiment. There may be cases where the opposite is true. I will try to look more closely at this soon.
Glad to hear that you have seen something similar. See my reply to Heng's comment below. I have seen this in about 80% of my mapping runs this week using tmap, bwa mem, bowtie2, bwasw and bwa aln. Interestingly, the counts were same for only bwa aln produced SAM files.
I did check the counts in Tablet/Bamview and it looks like samtools depth is ignoring some sites and under-counting others. Do I need to tweak the parameters? Thanks
This happened to me too, but after playing around with both I realised that it's because samtools mpileup counted in properly mapped only, while bedtools used all the mappings.
Solution: filter the .bam file first using samtools view -b -f 2 to get only mappings with properly mapped reads, then pass that output to bedtools.
You will get exactly the same results for both of the methods.
The discrepancy in my case seems to be due to secondary alignments. samtools view -b -F 0x100 whatever.bam | bedtools coverage -abam stdin ...
agrees with samtools depth. I want to count those secondary alignements though, so samtools view | bedtools coverage seems to be the most proper way to go.
ETA: samtools view -u ... | bedtools coverage ... appears to work fine too, despite what the docs implies
Based on my comments above, I looked more closely into this. I ran the exact commands you used on a test file I have. It is never the case that the samtools depth is reported to be higher than bedtools. When discrepant, bedtools is always higher.
Here is an example. In this case, bedtools reports a coverage of 5, whereas samtools reports coverage of 1. Every other case I check out has the same pattern. I actually doubt there is a bug in the samtools logic for counting depth. I suspect that instead, the parameters you are using aren't behaving exactly the way you expect, and as such, some alignments are just being filtered out before being counted.
Samtools skips secondary alignments and aberrant read pairs. It is possible to change the default in the code, but not possible at command line. Mpileup has -A, but it turns out to be useless (a bug). If my guess is correct, 4 of reads in your example do not have bit 2 flagged.
Yep, in that example 4 reads are duplicates. bedtools just counts whatever you have in your BAM file so long as it is mapped. Filtering duplicates, discordants, etc. is expected to be done ahead of time with samtools, Picard, or bamtools. I think this solves the discrepancy.
I have found exactly the same after exploring a few cases by eye. The coverage count was less for mpileup/depth even when the base count was equal to genomeCoverageBed. I will go with the genomeCoverageBed counts. Thank you for looking into it.
A guess: I know that samtools outputs only positions with depth > 0. Might it be that genomeCoverageBed instead outputs a line for every position in the reference? If your reference is 1,027,173 bases long then this is likely.
There might be a couple of things going on here. The first is that I think you are using an outdated version of BEDTools - I can't find genomeCoverageBed anywhere in the current git repository, and I think it has been superseded by coverage. In any case, you should take a look at the -split option in the documentation for coverage as I suspect that your reads have insertions and deletions, and that BEDTools is not correctly splitting the CIGAR string of these reads to produce coverage in the way that samtools does. This could result in a different number of bases in your reads "covering" the bases in the alignment, and a different number of lines in your output.
That's right, the genomeCoverageBed command is just a shell script that is built when you install bedtools. It merely calls the new tool "bedtools genomecov".
i have a question regarding the depth calculated by samtools, when there is a deletion in the read.
scaffold79 978 N 5 g$gggg HGHGH
scaffold79 979 N 4 aaaa HHGH
scaffold79 980 N 4 g-1ng-1ng-1ng-1n HH?H
scaffold79 981 N 4 **** HHEH
scaffold79 982 N 4 tttt HHEH
scaffold79 983 N 4 tttt HHGH
scaffold79 984 N 4 tttt HHGD
at position 980 it is indicated, that here is a deletion in 4 reads at the following position. So these 4 reads do not really cover the next position 981.
Should i check for these positions and correct the depth to gain the real coverage, or is this the accepted way to calculate the coverage?
You should start a new question for future queries like this, but the answer is "it depends". If you're looking for base-by-base depth, then yes, the coverage at that position would be 1. If you're calculating genomic coverage for sequencing metrics, you wouldn't penalize that position for having a deletion that was adequately covered by reads.
If there are any spliced (CIGAR ~ /N/) alignments in your BAM file, bedtools will count the bases in the spiced region by default. To prevent this, use the
-split
option.Thanks for the prompt response! Unfortunately, using -split did not change the counts.
I think you should follow Heng's advice and look at discrepant positions with IGV or the like. I did this myself and found positions where bedtools reports coverage that is higher than samtools. When I look in IGV, the alignments support the bedtools counts, but that is an N=1 experiment. There may be cases where the opposite is true. I will try to look more closely at this soon.
Glad to hear that you have seen something similar. See my reply to Heng's comment below. I have seen this in about 80% of my mapping runs this week using tmap, bwa mem, bowtie2, bwasw and bwa aln. Interestingly, the counts were same for only bwa aln produced SAM files.
You can extract sites unique to bedtools and check the alignment in a BAM viewer.
I did check the counts in Tablet/Bamview and it looks like samtools depth is ignoring some sites and under-counting others. Do I need to tweak the parameters? Thanks