Question

Difference between Pysam Pileup and Fetch

1

Entering edit mode

5.8 years ago

aalith ▴ 20

I've been attempting to use pysam to find how many reads support a particular SNP vs how many reads do not. I have not been able to exactly pinpoint what pysam pileup is doing. Is it exactly like the samtools function?

For example, to compare what exactly is focused on for each function, I've run this...

for reads in bam.fetch(chrom, start = pos, stop = pos + 1):
    fetch_reads.append(reads.cigarstring)

for refsection in bam.pileup(chrom, start = pos, stop = pos + 1):
    for reads in refsection.pileups:
        pileup_reads.append(reads.alignment.cigarstring)
        if not reads.is_refskip:
            pileup_reads_norefskip.append(reads.alignment.cigarstring)

print(len(fetch_reads))
print(len(pileup_reads))
print(len(pileup_reads_norefskip))

And I get wildly different numbers for what I think are the number of reads at the particular site I'm interested in. 81,157492, and 4856 respectively. Also, when I try to append get_num_aligned(reads) to a list (supposed to return the number of aligned bases at a certain pileup column position), pysam gives me back a list of lists.

pysam pileup reads bam • 3.1k views

ADD COMMENT • link updated 5.6 years ago by castaway1990 • 0 • written 5.8 years ago by aalith ▴ 20

score 0 · Answer 1 · 2019-12-01

I had many troubles with those stuff too. If i remember correctly if you go for the "fetch" way after calling bam.fetch you also have to check if the outputted reads actually overlap the genomics position you searched for (which is not obvious if you deal with spliced reads maybe) since fetch will just report the inferred overlap looking at read start + read end, it does not guarantee you are really reading the desired base.

You can try something like this to see if the situation gets better:

import pysam
fetch_reads=[]
for read in bam.fetch(str(chr), int(position)-1, int(position)):
    if int(position)-1 in read.get_reference_positions():
        fetch_reads.append(read)

To better understand:

import pysam
position = yourPosition
chr = yourChr
bam=pysam.AlignmentFile(yourBam, "rb")


readnamesA=[]
for read in bam.fetch(str(chr), int(position)-1, int(position)):
    readnamesA.append(read.qname)

readnamesB=[]
for read in bam.fetch(str(chr), int(position)-1, int(position)):
    if int(position)-1 in read.get_reference_positions():
        readnamesB.append(read.qname)

testRead=list(set(readnamesA).difference(set(readnamesB)))[1]

for read in bam.fetch(str(chr), int(position)-1, int(position)): 
    if read.qname == testRead:
        print(read.positions)

You will probably see printed gapped genomics positions of the actually covered bases, among which you won't see your original fetched position