Problem Running Htseq
1
1
Entering edit mode
11.1 years ago
lkmklsmn ▴ 980

Hi I am trying to calculate exon counts from a .bam file using python and HTSeq.
After reading the bam file:

"bam_reader = HTSeq.BAM_Reader( <sorted bam file> )"

I get following error when trying to extract aligned reads

"for alnmt in bam_reader:
...  print alnmt"

"Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "../HTSeq/__init__.py", line 832, in __iter__
    yield SAM_Alignment.from_pysam_AlignedRead( pa, sf )
  File "_HTSeq.pyx", line 1251, in HTSeq._HTSeq.SAM_Alignment.from_pysam_AlignedRead (src/_HTSeq.c:21907)
  File "csamtools.pyx", line 800, in pysam.csamtools.Samfile.getrname (pysam/csamtools.c:8816)
ValueError: tid -1 out of range 0<=tid<25"

Does anyone have an idea what the problem is?

rnaseq htseq python bam alignment • 4.5k views
ADD COMMENT
1
Entering edit mode

A tid of -1 would mean a * in the RNAME column (the 3rd column). I'd need to go through the code, but does your BAM file perhaps have a read with a * in the RNAME column that lacks a 0x4 bit set in the FLAG? I imagine that could cause this error.

Edit: BTW, you can test whether this is the case with something like samtools view foo.bam | awk '{if($3 == "*") { if(!and($2, 0x4)) { print $0; }}}', which should print all reads that might cause this error.

ADD REPLY
1
Entering edit mode

Thanks. I wasnt able to make that awk statement work but I think you are right. Do you know any efficient way to correct this?

ADD REPLY
0
Entering edit mode

Hmm, apparently mawk is installed on ubuntu (perhaps that's what you're using, it's what I have in lab) and it lacks awk's bitwise operators (OS X must have gawk installed, since that awk command worked on my laptop last night!). I'll put together a quick script that should work regardless and post that as an answer.

ADD REPLY
0
Entering edit mode

You can use samtools's built in required filter -f:

samtools view -f 4 xyz.bam | awk '{if($3 == "*") { print $0; }}'

ADD REPLY
1
Entering edit mode
11.1 years ago

Since the bitwise operators in awk that I used in my comment apparently don't exist on all system (that's good to know!), here's a simply python solution. You could do this with perl to, but I loath the language.

#!/usr/bin/env python
import csv
import sys

f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
for line in f :
    #Don't try to modify the header!!!
    if(line[1][0] == "@") :
        of.writerow(line)
        continue

    if(line[2] == "*") :
        if((int(line[1]) & 4) != 4) :
            line[1] = "%s" % (int(line[1]) | 4)
    of.writerow(line)

If you saved that as clean.py, then you can probably just samtools view namesorted.bam | python clean.py | htseq-count.py options - annotation.gtf if you want to directly use htseq-count. You can also just try to make a cleaned BAM file with samtools view -h something.bam | python clean.py | samtools view -bS - > something.clean.bam (if you really need that to be faster, you can implement the whole thing in C with the samtools API, but that's likely overkill!).

ADD COMMENT
0
Entering edit mode

Thank you! I just had this exact issue with pysam alone, without HTSeq but this greatly helps me too.

ADD REPLY
0
Entering edit mode

Good to know. Which aligner did you use that caused this formatting?

ADD REPLY
0
Entering edit mode

We use RNA-STAR 2.3

ADD REPLY
0
Entering edit mode

Interesting. I too use STAR, but haven't seen this behavior. Have you reported this to the author yet? If not, please do so, this problem is going to bite a number of people!

ADD REPLY

Login before adding your answer.

Traffic: 2017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6