Pysam changes bam format when printing
2
2
Entering edit mode
7.5 years ago

Dear all,

I have indexed bam file, and I want to print all alignments on chromosome "2" using pysam.

My code is

import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord_rep1.out.bam", "rb")
for line in bam.fetch("2"):
     print line

What I get is:

D00733:162:CADM2ANXX:2:1311:7466:63232  163 18  3328    255 1S47M2S 18  3419    47  TTGCTTAGTGTCCGAAATACCATCCTCAAGGCTAAGAACTAAATCGATTA  array('B', [33, 33, 33, 33, 33, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38])    [('NH', 1), ('HI', 1), ('AS', 70), ('nM', 10)]
D00733:162:CADM2ANXX:2:1215:11810:89291 163 18  3356    255 44M6S   18  3441    44  GGCTAAGAACTAAATCGATTATTCTGGCTCGTAACGCATATAATATGGCA  array('B', [33, 33, 33, 32, 33, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 35, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 33])    [('NH', 1), ('HI', 1), ('AS', 70), ('nM', 10)]

So there are 2 questions: 1. Why pysam does not preserve format of bam file? 2. Why does it print out chromosome 18 instead of 2?

I would appreciate if somebody points out on what am I doing wrong. Thanks

pysam • 3.9k views
ADD COMMENT
4
Entering edit mode
7.5 years ago
IP ▴ 770

Hi, you are not wrong, you are doing it correctly :)

The AlignedSegment object is independent of a SAM-file. The AlignedSegment contains only an index of the chromosomal identifier (tid), and, thus, it does not contain the chromosome number information. It means that the number you see as 18 in the field 3 of the line is the number corresponding to chromosome 2 in the sequence dictionary. You can check that by doing the following:

import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord_rep1.out.bam", "rb")
for line in bam.fetch("2"):
    print line
    print line.reference_id
    print bam.get_reference_name(line.reference_id)

As you might have notice, I have added two lines to your previous code:

1.line.reference_id contains the index of the reference sequence in the sequence dictionary

2.bam.get_reference_name(line.reference_id) will return the reference name corresponding to numerical tid. In your case chromosome 2.

ADD COMMENT
0
Entering edit mode

Thanks for your reply, now it makes sense. My original task is a bit different from what I have posted (In my post I wanted to check whether I do everthing correct).

What I need to do is read everyline of bam file, and save only multimap alignments. So what I was palnning to do is following:

for line in bam_file:
    if not "NH:i:1" in line:
        print line

So it seems that this method of pysam will not work for me, will it? Are any other ways of reading bam file line by line? Thank you

ADD REPLY
0
Entering edit mode

As i.sudbery metion in his response, the has_tag and get_tag objects will resolve your issue

ADD REPLY
4
Entering edit mode
7.5 years ago

In order to print an AlignedSegment out as a formatted SAM entry you need to use the tostring method and supply the AlignmentFile that the AlignedSegment came from. This contains the mapping of the numeric contig ids to the string contig names and also the encoding for the quality scores.

Thus your code should be:

import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord_rep1.out.bam", "rb")
for line in bam.fetch("2"):
     print line.tostring(bam)

following on from your comment, to filter on an NH tag you would do the following:

import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord_rep1.out.bam", "rb")
for line in bam.fetch("2"):
    if not (line.has_tag("NH") and line.get_tag("NH") == 1):     
        print line.tostring(bam)

you only need the has_tag if only some of your lines have the NH tag.

ADD COMMENT
0
Entering edit mode

Worked perfectly, thank you!

ADD REPLY
1
Entering edit mode

To answer your actaul question... Its like this because this is what is actually stored in a BAM file, which are not quite just compressed SAM files. pysam acts as a wrapper around the htslib c library, which is the same thing samtools uses, and that returns a c-structure representing the read rather than a text representation as you see in a SAM file. When you do samtools view, it is doing something similar to the code above.

ADD REPLY

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6