Question

pysam doesn't calculate correct nucleotide quantities from .bam file

1

Entering edit mode

4.8 years ago

Gama313 ▴ 130

I am very new to pysam, but I am trying to generate a table in which are listed all the nucleotides by genomic position divided by strand with python. I am working on amplicon data, so it would be possible that pysam would mark them as duplicates (as default). By now I simply re-wrote codes from a previous post (https://www.biostars.org/p/215192/#217314), but counts are not correct. In particular, I observed that only nucleotides observed in Read1 are reported (and are not fully correct), but counts from Read2 are almost absent, even if there are some positions with high counts (but presumably wrong). Anyone experienced such a strange behaviour? Below is reported the code for simplicity: import pysam

samfile = pysam.Samfile("reads.sorted.bam", "rb")
mystring = ''
mylist = []
for pileupcolumn in samfile.pileup("gi|251831106|ref|NC_012920.1|"):        
    mystring = ''
    for pileupread in pileupcolumn.pileups:
        if not pileupread.is_del and not pileupread.is_refskip:                
            mystring = mystring + pileupread.alignment.query_sequence[pileupread.query_position]
            mystring = "".join(sorted(list(mystring)))  # Sort alphabetically
            mylist.append((pileupcolumn.pos+1,mystring))
samfile.close()

pysam count python • 1.5k views

ADD COMMENT • link updated 4.8 years ago by dariober 15k • written 4.8 years ago by Gama313 ▴ 130

score 0 · Answer 1 · 2020-09-25

0

Entering edit mode

4.8 years ago

dariober 15k

Some random thoughts looking at the documentation for pysam/pileup:

max_depth: default to 8000, maybe you need to increase this limit
ignore_overlaps: the default is True and it may explain why only read 1 is counted
flag_filter: Maybe you want to disable this filter and count all reads