Question

Pysam (python) problem parsing BAM reference id

1

Entering edit mode

8.6 years ago

liorglic ★ 1.5k

Hello,

I am trying to use the pysam package in python to do some processing of a large BAM file, which is a BWA output. The header of my file looks like that:

$ samtools view -H myFile.bam
@HD     VN:1.3  SO:coordinate
@SQ     SN:1    LN:...
@SQ     SN:2    LN:...
@SQ     SN:3    LN:...
@SQ     SN:4    LN:...
@SQ     SN:5    LN:...
@SQ     SN:6    LN:...
@SQ     SN:7    LN:...
@SQ     SN:8    LN:...
@SQ     SN:9    LN:...
@SQ     SN:10   LN:...
@SQ     SN:10000001     LN:...
@PG     ...

and I have the following alignment record within this file (just an example):

REC-XYZ-1      147     10000001        1       36      9S91M   =       3601    3511    NAAAAACTCTAGTTGCTCGTGGGTGTCAACTTTTGATTATATGCTGTTATGTTTGTGCAAACTCAACTGAAATGTGAACTCAAGTGGTGATGCTATGAAC    DBCC@:CDCCDDDDDDFFHGHIIIIEGHEHIIGIJIJIEHCGGFIDHHCHJJJJIIIIHHDIFGIFJIIIGIFJJJGHGEIIGIGGIHDFHHFFFDF@@@    NM:i:0  MD:Z:91 AS:i:91 XS:i:99

(notice the '10000001' in the 3rd field).

Using pysam, I do the following:

bam_file = pysam.AlignmentFile("myFile.bam", "rb")
it = bam_file.fetch("10000001")
for x in range(20):
    aln = next(it)
    print(aln.reference_id)

10
10
10
.
.
.

surprisingly, although the real reference id is '10000001', pysam sees it as '10'. For example, the same record I showed above looks like this:

REC-XYZ-1   147 10  0   36  9S91M   10  3600    91  NAAAAACTCTAGTTGCTCGTGGGTGTCAACTTTTGATTATATGCTGTTATGTTTGTGCAAACTCAACTGAAATGTGAACTCAAGTGGTGATGCTATGAAC    array('B', [35, 33, 34, 34, 31, 25, 34, 35, 34, 34, 35, 35, 35, 35, 35, 35, 37, 37, 39, 38, 39, 40, 40, 40, 40, 36, 38, 39, 36, 39, 40, 40, 38, 40, 41, 40, 41, 40, 36, 39, 34, 38, 38, 37, 40, 35, 39, 39, 34, 39, 41, 41, 41, 41, 40, 40, 40, 40, 39, 39, 35, 40, 37, 38, 40, 37, 41, 40, 40, 40, 38, 40, 37, 41, 41, 41, 38, 39, 38, 36, 40, 40, 38, 40, 38, 38, 40, 39, 35, 37, 39, 39, 37, 37, 37, 35, 37, 31, 31, 31])    [('NM', 0), ('MD', '91'), ('AS', 91), ('XS', 99)]

(notice the '10' in the 3rd field)

The same thing happens for the RNEXT field (which I get through aln.next_reference_id), and this is really bad because I cannot differentiate between chromosome 10 and chromosome 10000001. Does anyone know why this is happening? Is this a bug in pysam? I would be quite amazed to find out that it can only take two-digit values as reference id (nothing about that in the documentation). Any ideas?

* I am running on linux, with python 2.7 and the latest version of pysam, using IPython notebook.

bam sam python pysam • 6.6k views

ADD COMMENT • link updated 8.6 years ago by Matt Shirley 10k • written 8.6 years ago by liorglic ★ 1.5k

0

Entering edit mode

Devon already explained how to get the actual chromosome name from the refID, however I just thought i'd add where that number comes from.

The refID (number in each row of the BAM file that corresponds to a chromosome in the BAM header) is based on the order of the chromosomes in the header. So the first chromosome has a refID of 0, and the 11th has a refID of 10 -- which is what SN:10000001 is here for you :)

To make matters a little more complicated, the BAM file actually has two headers - an ASCII header that gets printed when you do samtools view -H, and a second redundant binary header that is used to do chrID -> rname conversion. Of course, these two headers should always agree - but they don't have to. I played an April's fools joke on reddit a few years back, where I uploaded a BAM file where the headers didn't match up and asked them to figure out why. Ho ho ho.

ADD REPLY • link 8.6 years ago by John 13k

0

Entering edit mode

8.6 years ago

Matt Shirley 10k

You're encountered one of the quirks of pysam, and hopefully now that you're pointed in the right direction all is good. I just waned to point out that this type of complexity is the exact reason I created my own parser simplesam:

import simplesam 
with open('myFile.bam', 'rb') as bamfile: 
  with simplesam.Reader(bamfile, regions=("10000001",)) as bam: 
    for read in bam:
      print(read.rname)

Each attribute matches the SAM specification, and there are some other class methods that are useful - I should document the whole thing better.

ADD COMMENT • link 8.6 years ago by Matt Shirley 10k

score 2 · Accepted Answer · 2016-04-11

2

Entering edit mode

8.6 years ago

Devon Ryan 104k

If you look at the read the docs page for pysam, you'll see the following warning regarding reference_id:

This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.AlignmentFile.getrname()

In other words, pysam isn't confusing anything, it just happens that many of the IDs happen to correspond to chromosome names in your case (though they'll be the wrong chromosome names).

ADD COMMENT • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks so much! Totally missed it.

print(bam_file.get_reference_name(aln.next_reference_id))

did the trick.

ADD REPLY • link 8.6 years ago by liorglic ★ 1.5k