Hi! I'm kind of new to python and have some trouble using Pysam.
I'm using Python 2.7 on Debian 8.
I have a sample.bam file (90Mb) and a sample.bam.bai (created with samtools). I wan't to use it in order to experiment with pysam, but can't get it to work.
This is the code, as written in pysam intro:
samfile = pysam.AlignmentFile('sample.bam', 'rb')
for read in samfile.fetch('chr1', 100, 120):
print read
When I run this code, I get absolutely no output. No errors of anykind. I tried putting print 'helo'
instead, of print 'read'
, but still no output. So i tried this instead:
samfile = pysam.AlignmentFile('sample.bam', 'rb')
for read in samfile.fetch('chr1'):
print read
That is, without a given range, and all of the sudden I get an "endless" output of chr1.
So basically, it's the range part that doesn't work with the code. I've tried googling wihtout any luck. Does anyone know what's going on?
Have you checked (e.g. in IGV) if the range you specified (chr1:100-120) does contain reads at all?
No, but I used qualimap to check the content of the BAM file and saw that chromosome 1 is roughly 300 million in length. Shouldn't 100-120 exist in that case?
Or have I complete missunderstood the meaning of range in this context?
Range would be the coordinates on the chromosome. It's (almost telomeric), so depending on your dataset (WES?/RNA-seq?) it may or may not contain reads.
Okay, I just opened the bam file with IGV and the length of the chr1 is visibly around 250-300 million bases. Shouldn't it then contain the coordinates 100-120? Sorry, I don't know about what dataset this is, can I view that with IGV?
I downloaded the BAM file from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwRepliSeq/ to exerpiment with.
Just zoom into that region. However, that will have no reads in hg19 (the telomeres are hardmasked).
Therefore, fetching probably retrieves nothing. I would consider this a bug in the documentation of pysam. Why not take more sensible example... A ubiquitous expressed gene or something...
I'm not sure where the documentation problem is. It's an iterator that will iterate over nothing. One could
samfile.count()
ahead of time.It's not convenient to display new users with an example that's going to return nothing as one of the first commands they encounter.
It's a fair point - but no matter what range you put in the docs theres a possibility the reads won't be there. The docs should just state that there has to be reads aligned to the region being fetched, else there will be no output.
Having said that, since the first (30,000?) bp are hard masked as Devon says, these numbers really should be higher...