Hello,
I am trying to use the pysam package in python to do some processing of a large BAM file, which is a BWA output. The header of my file looks like that:
$ samtools view -H myFile.bam
@HD VN:1.3 SO:coordinate
@SQ SN:1 LN:...
@SQ SN:2 LN:...
@SQ SN:3 LN:...
@SQ SN:4 LN:...
@SQ SN:5 LN:...
@SQ SN:6 LN:...
@SQ SN:7 LN:...
@SQ SN:8 LN:...
@SQ SN:9 LN:...
@SQ SN:10 LN:...
@SQ SN:10000001 LN:...
@PG ...
and I have the following alignment record within this file (just an example):
REC-XYZ-1 147 10000001 1 36 9S91M = 3601 3511 NAAAAACTCTAGTTGCTCGTGGGTGTCAACTTTTGATTATATGCTGTTATGTTTGTGCAAACTCAACTGAAATGTGAACTCAAGTGGTGATGCTATGAAC DBCC@:CDCCDDDDDDFFHGHIIIIEGHEHIIGIJIJIEHCGGFIDHHCHJJJJIIIIHHDIFGIFJIIIGIFJJJGHGEIIGIGGIHDFHHFFFDF@@@ NM:i:0 MD:Z:91 AS:i:91 XS:i:99
(notice the '10000001' in the 3rd field).
Using pysam, I do the following:
bam_file = pysam.AlignmentFile("myFile.bam", "rb")
it = bam_file.fetch("10000001")
for x in range(20):
aln = next(it)
print(aln.reference_id)
10
10
10
.
.
.
surprisingly, although the real reference id is '10000001', pysam sees it as '10'. For example, the same record I showed above looks like this:
REC-XYZ-1 147 10 0 36 9S91M 10 3600 91 NAAAAACTCTAGTTGCTCGTGGGTGTCAACTTTTGATTATATGCTGTTATGTTTGTGCAAACTCAACTGAAATGTGAACTCAAGTGGTGATGCTATGAAC array('B', [35, 33, 34, 34, 31, 25, 34, 35, 34, 34, 35, 35, 35, 35, 35, 35, 37, 37, 39, 38, 39, 40, 40, 40, 40, 36, 38, 39, 36, 39, 40, 40, 38, 40, 41, 40, 41, 40, 36, 39, 34, 38, 38, 37, 40, 35, 39, 39, 34, 39, 41, 41, 41, 41, 40, 40, 40, 40, 39, 39, 35, 40, 37, 38, 40, 37, 41, 40, 40, 40, 38, 40, 37, 41, 41, 41, 38, 39, 38, 36, 40, 40, 38, 40, 38, 38, 40, 39, 35, 37, 39, 39, 37, 37, 37, 35, 37, 31, 31, 31]) [('NM', 0), ('MD', '91'), ('AS', 91), ('XS', 99)]
(notice the '10' in the 3rd field)
The same thing happens for the RNEXT field (which I get through aln.next_reference_id), and this is really bad because I cannot differentiate between chromosome 10 and chromosome 10000001. Does anyone know why this is happening? Is this a bug in pysam? I would be quite amazed to find out that it can only take two-digit values as reference id (nothing about that in the documentation). Any ideas?
* I am running on linux, with python 2.7 and the latest version of pysam, using IPython notebook.
Devon already explained how to get the actual chromosome name from the refID, however I just thought i'd add where that number comes from.
The refID (number in each row of the BAM file that corresponds to a chromosome in the BAM header) is based on the order of the chromosomes in the header. So the first chromosome has a refID of 0, and the 11th has a refID of 10 -- which is what SN:10000001 is here for you :)
To make matters a little more complicated, the BAM file actually has two headers - an ASCII header that gets printed when you do samtools view -H, and a second redundant binary header that is used to do chrID -> rname conversion. Of course, these two headers should always agree - but they don't have to. I played an April's fools joke on reddit a few years back, where I uploaded a BAM file where the headers didn't match up and asked them to figure out why. Ho ho ho.