Question

Pysam Fetching specified regions

0

Entering edit mode

7.9 years ago

AlchemistMoz ▴ 40

Hi! I'm kind of new to python and have some trouble using Pysam.

I'm using Python 2.7 on Debian 8.

I have a sample.bam file (90Mb) and a sample.bam.bai (created with samtools). I wan't to use it in order to experiment with pysam, but can't get it to work.

This is the code, as written in pysam intro:

    samfile = pysam.AlignmentFile('sample.bam', 'rb')
    for read in samfile.fetch('chr1', 100, 120):
        print read

When I run this code, I get absolutely no output. No errors of anykind. I tried putting print 'helo' instead, of print 'read', but still no output. So i tried this instead:

    samfile = pysam.AlignmentFile('sample.bam', 'rb')
    for read in samfile.fetch('chr1'):
        print read

That is, without a given range, and all of the sudden I get an "endless" output of chr1.

So basically, it's the range part that doesn't work with the code. I've tried googling wihtout any luck. Does anyone know what's going on?

pysam python • 13k views

ADD COMMENT • link 7.9 years ago by AlchemistMoz ▴ 40

0

Entering edit mode

Have you checked (e.g. in IGV) if the range you specified (chr1:100-120) does contain reads at all?

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

No, but I used qualimap to check the content of the BAM file and saw that chromosome 1 is roughly 300 million in length. Shouldn't 100-120 exist in that case?

Or have I complete missunderstood the meaning of range in this context?

ADD REPLY • link 7.9 years ago by AlchemistMoz ▴ 40

0

Entering edit mode

Range would be the coordinates on the chromosome. It's (almost telomeric), so depending on your dataset (WES?/RNA-seq?) it may or may not contain reads.

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Okay, I just opened the bam file with IGV and the length of the chr1 is visibly around 250-300 million bases. Shouldn't it then contain the coordinates 100-120? Sorry, I don't know about what dataset this is, can I view that with IGV?

I downloaded the BAM file from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwRepliSeq/ to exerpiment with.

ADD REPLY • link 7.9 years ago by AlchemistMoz ▴ 40

2

Entering edit mode

Just zoom into that region. However, that will have no reads in hg19 (the telomeres are hardmasked).

ADD REPLY • link 7.9 years ago by Devon Ryan 105k

0

Entering edit mode

Therefore, fetching probably retrieves nothing. I would consider this a bug in the documentation of pysam. Why not take more sensible example... A ubiquitous expressed gene or something...

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

I'm not sure where the documentation problem is. It's an iterator that will iterate over nothing. One could samfile.count() ahead of time.

ADD REPLY • link 7.9 years ago by Devon Ryan 105k

1

Entering edit mode

It's not convenient to display new users with an example that's going to return nothing as one of the first commands they encounter.

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

It's a fair point - but no matter what range you put in the docs theres a possibility the reads won't be there. The docs should just state that there has to be reads aligned to the region being fetched, else there will be no output.

Having said that, since the first (30,000?) bp are hard masked as Devon says, these numbers really should be higher...

ADD REPLY • link 7.9 years ago by John 13k

score 0 · Answer 1 · 2017-02-10

0

Entering edit mode

7.9 years ago

Devon Ryan 105k

Your original code should have been:

for read in samfile.fetch('chr1', 100, 120):

ADD COMMENT • link 7.9 years ago by Devon Ryan 105k

0

Entering edit mode

Sorry, I don't see the difference? In the first code of my post I also wrote:

for read in samfile.fetch('chr1', 100, 120):

Am I missing something?

ADD REPLY • link 7.9 years ago by AlchemistMoz ▴ 40

0

Entering edit mode

That's not what you wrote in your post, you wrote 'chr1, 100, 120' as a single string, which would be a non-existent chromosome name.

ADD REPLY • link 7.9 years ago by Devon Ryan 105k

0

Entering edit mode

oh sorry for that, that was just a typo in this post. The original code was correctly written ('chr1', 100, 120):

ADD REPLY • link 7.9 years ago by AlchemistMoz ▴ 40

score 0 · Answer 2 · 2017-02-13

0

Entering edit mode

7.9 years ago

AlchemistMoz ▴ 40

I managed to solve the probelm in the end by simply using another BAM file. I'm new to this, so didn't 100% understand all of your explanations as to why it didn't work with that specific BAM file. I'll come back to this post later on and see if I understand things better.

Thanks for all the help!

ADD COMMENT • link 7.9 years ago by AlchemistMoz ▴ 40

2

Entering edit mode

You asked Pysam to give you all the reads on chromosome 1 between base 100 to 120. This is just a tiny window of 20bp, right at the beginning of the chromosome. The ends (and middle, and certain other regions) of chromosomes are highly repetitive and thus reads don't map uniquely to these regions at all. Thus, your BAM file didn't have any reads in that tiny window, and thus pysam returned nothing. If you use a BAM file which does have reads right at the beginning of the chromosome, or you change the start/end values in the fetch command, you will see some reads.

ADD REPLY • link 7.9 years ago by John 13k