Hi folks,
I'm trying to use pysam's pileup in order to count some base changes occurring at certain points in my transcript of interest using MinION data. However my coverage at each base is often three times as large as the total of reads in my final counts and I'm trying to figure out this discrepancy.
My code is the following:
def stats_4605(input_file):
A = 0
T = 0
C = 0
G = 0
total = 0
samfile = pysam.AlignmentFile(input_file, "rb")
for pileupcolumn in samfile.pileup("ref", 4604, 4605):
if pileupcolumn.pos != 4604:
continue
else:
print(input_file, "\nCoverage at base %s = %s" %
(pileupcolumn.pos, pileupcolumn.n))
for pileupread in pileupcolumn.pileups:
total += 1
if not pileupread.is_del and not pileupread.is_refskip:
if pileupread.alignment.query_sequence[pileupread.query_position] == "A":
A += 1
elif pileupread.alignment.query_sequence[pileupread.query_position] == "T":
T += 1
elif pileupread.alignment.query_sequence[pileupread.query_position] == "C":
C += 1
elif pileupread.alignment.query_sequence[pileupread.query_position] == "G":
G += 1
print("Total counted ", total)
print("A: ", A, "T: ", T, "C: ", C, "G: ", G)
print("T:A ratio = ", T/A)
print('')
samfile.close()
An example of the output I get is:
Coverage at base 4604 = 11547
Total counted 4502
A: 3063 T: 577 C: 4 G: 301
T:A ratio = 0.18837740777015996
So far I think the problem is with the line if not pileupread.is_del and not pileupread.is_refskip:
Deleting pileup.is refskip shows no changes, but pileupread.is_del gives an error message:
Traceback (most recent call last):
File "/home/user/PycharmProjects/samp_comp_extractor/4605_a_vs_t.py", line 36, in <module>
stats_4605("6hr_mapped_sorted.bam")
File "/home/user/PycharmProjects/samp_comp_extractor/4605_a_vs_t.py", line 20, in stats_4605
if pileupread.alignment.query_sequence[pileupread.query_position] == "A":
TypeError: string indices must be integers
Any idea if these are related or how I could get around these issues? I've tried generating a VCF for these reads but can't get it to produce an AF segment and would prefer more in depth numbers for my use case.
Cheers!
The problem with the code is that
pileupread.query_position
is not an integer. Not sure why.Oh.... Because if the base is deleted from the "query" (i.e. the read), then it has no position in the query_sequence.
Eyeballing it in a genomeviewer there are some deleted bases but looks do be much closer to 1 in 10 reads rather than 5-6/10 reads which my script is returning. I'm not sure if its a nanopore issue since it is only single reads and does have what looks to be many deletions but not enough over my region of interest to cause this drop in reads
Yeah, sorry, my comment was over what was causing the error from your code, not what was causing the underlying difference in read counts.
I noticed from the pysam documentation that
n()
includes bases that don't pass the quality filter. Perhapspileups
excludes these? What doesget_num_aligned()
return?It gives me the same number that my
print("Total counted ", total)
line gives me. But in the docs it says :This makes me think its a quality score issue, I'll have a dig around seeing if its possibly to set that cutoff lower but if you have any advice there that would be great!
I've found
set_min_base_quality(self, min_base_quality)
and added that to my code on line 15 just after the else which now works perfectly. Thanks for helping point me in the right direction!