Hi guys,
I'm trying to create a program that can count the number of reads in a BAM file that overlap a specific genomic coordinate(for example, chr15:28365618) and which nucleotides are present in those reads(with a minimum base quality of 30). This is my program so far:
import pysam
bamfile=pysam.AlignmentFile("file1.bam","rb")
for read in bamfile.fetch("chr15",28365618,28365619)
a=0
c=0
g=0
t=0
n=0
pos=read.reference.start
nucl=read.query_sequence[28365618-pos]
for i in nucl
if i=="A":
a+=1
elif i=="C":
c+=1
elif i=="G":
g+=1
elif i=="T":
t+=1
elif i=="N":
n+=1
print a+c+g+t+n #total no. of reads
print "A;",a
print "C:" c
print "G",g
print "T",t
print "N",n
However,my output is way off when I compare it to IGV, and I can't quite figure out why.Maybe because of duplicate reads? Any help will be greatly appreciated.
BTW, I corrected the formatting and added the missing
:
afterif i=="T"
.Hi You can user Counter from the collections module to do the counting.
from collections import Counter
counts = Counter(sequence)
Here is a one liner which should work:
all_bases = [Counter([aln.seq[i] for i in range(len(aln.seq)) if aln.qual[i] > 30]) for aln in bamfile.fetch('chr5', 1234, 12345)]
OK, is it working? Is it not? Are you asking for help with something or just showing what you can do?
Not quite, the output is way off and I'm not sure what's wrong with the code. Also, to find the base quality I used the following code:
[ord(c)-33 < _30 for c in list(read.qual)]):
However, it doesn't seem to have any effect .
Something like the following would seem to do what you want:
This seems to be working a lot better ,thanks so much. Just a quick query: Is there a way of knowing how many of these reads are mapped to the forward strand? I tried using
if read.is_Reverse
, but it doesn't seem to do the trick. Also,I'm trying to remove duplicate reads as well.Again, thanks a lot.
if read.is_reverse
, spelling and capitalization are rather important in programming. If duplicates have been marked, thenif read.is_duplicate
.in your
for
loop you haven't definedn
. and I guess you need to change threeif
s toelif
!Yeah, edited that. Still, the program doesn't seem to give the correct output.