I'm writing a new version of a tool and I'm trying to implement some functionality using pysam
that was previously implemented using samtools bedcov
.
For example, here is a sample output of samtools bedcov test.bed tcga_test.bam -Q 50
:
chr19 50858094 50858095 64004
chr19 50858128 50858129 63170
chr19 50858162 50858163 51889
To calculate the coverage I've been using the pysam
function AlignmentFile.count
in the following way:
import pysam
def read_check(read):
if read.flag & (0x4 | 0x100 | 0x200 | 0x400): # FUNMAP,FSECONDARY,FQCFAIL,FDUP
return False
elif read.mapping_quality <= 50:
return False
else:
return True
bamfile = pysam.AlignmentFile('tcga_test.bam')
bamfile.count('chr19', start = 50858094, stop = 50858095, read_callback = read_check)
bamfile.count('chr19', start = 50858128, stop = 50858129, read_callback = read_check)
bamfile.count('chr19', start = 50858162, stop = 50858163, read_callback = read_check)
From this I get
68361
63170
51889
samtools bedcov
is filtering the first region in some way I don't see. The difference here seems to be an edge case--I only run into it one or two times processing large BAM files, approx 1% of cases I test. I originally thought the difference may be due to filtering of overlapping read pairs. The code I used to test this was:
read_set = set()
for read in bamfile.fetch('chr19', start, start + 1):
read_set.add(read.query_name) if read_check(read) else None
len(read_set)
But I get:
67676
62374
50940
So clearly samtools bedcov
is not doing this type of filtering, though maybe my code doesn't work. From what I understand, samtools bedcov
uses samtools mpileup
to count the reads. I tried to test all the ways mpileup
could filter reads but I haven't come up with anything that explains the difference in the small number of cases I come across.