I am very new to pysam, but I am trying to generate a table in which are listed all the nucleotides by genomic position divided by strand with python. I am working on amplicon data, so it would be possible that pysam would mark them as duplicates (as default). By now I simply re-wrote codes from a previous post (https://www.biostars.org/p/215192/#217314), but counts are not correct. In particular, I observed that only nucleotides observed in Read1 are reported (and are not fully correct), but counts from Read2 are almost absent, even if there are some positions with high counts (but presumably wrong). Anyone experienced such a strange behaviour? Below is reported the code for simplicity: import pysam
samfile = pysam.Samfile("reads.sorted.bam", "rb")
mystring = ''
mylist = []
for pileupcolumn in samfile.pileup("gi|251831106|ref|NC_012920.1|"):
mystring = ''
for pileupread in pileupcolumn.pileups:
if not pileupread.is_del and not pileupread.is_refskip:
mystring = mystring + pileupread.alignment.query_sequence[pileupread.query_position]
mystring = "".join(sorted(list(mystring))) # Sort alphabetically
mylist.append((pileupcolumn.pos+1,mystring))
samfile.close()
It worked like a charm. Thanks a lot, I didn't this section. Bests