I received a transcriptome-aligned bam file from a coworker, which looks a little different than what I'm used to. A sample read from his bam looks like this:
HISEQ:240:C8VGLANXX:1:1101:1007:4288 0 gi|311893327|ref|NM_001198861.1| 365 255 26M * 0 0 NTTGGACAGGCGAATTCATGGTCGTG #<<ABGGDGGGGGGGGGGGGGGGDGB XA:i:1 MD:Z:0T25 NM:i:1
It lacks a chromosome identifier, which is a bit strange. Still, I thought I would be able to use a refseq GTF to count the reads with HTSeq, since a refseq identifier is present. My GTF file looks like this:
chr1 ucsc_refGene exon 4782568 4782733 . - . gene_id "Mrpl15"; transcript_id "NM_001177658"; exon_number "3"; exon_id "NM_001177658.3"; gene_name "Mrpl15";
My colleague was unsure whether his strands were reversed, so I ran HTSeq twice, once with --stranded=yes and once with --stranded=reverse. Either way, the read count tables report 0 reads for every gene:
Stranded=yes:
__no_feature 10954731
__ambiguous 0
__too_low_aQual 0
__not_aligned 5947831
__alignment_not_unique 0
Stranded=reverse
__no_feature 10954731
__ambiguous 0
__too_low_aQual 0
__not_aligned 5947831
__alignment_not_unique 0
I invoked HTSeq like this:
python -m HTSeq.scripts.count -f bam -r name --type=CDS --idattr=gene_id --mode=union --stranded=reverse blahblah.bam ucsc_refseq.gtf > blahblah-CDS-reverse.txt
I suppose the problem is that this isn't really the right gtf to use with this bam, but I'm not sure where I can get right kind, or what it should look like. Help?
Hi, Not sure if this is the right answer. Does your bam file has a header? It happened to me before that the alignment was done with a reference where the chromosomes were identified as '1, 2, 3 ...' and in the gtf file they are identified as 'chr1, chr2, chr3, ...'. I believe that if this is the case you can just modify the chromosome names in the gtf file. You can do: sed 's/^chr/^/g' gtf_file -i
I've had that problem too, and I don't think this is the same thing. Normally, my bam files look like this:
HISEQ:240:C8VGLANXX:1:1101:1000:59382 321 chr9 49079038 0 50M chr15 81972436 0 NCTTGGTCCTGTTTCCTCCTGTGGCCCGGAGTTTGATGTGCAGGGCAGTG #3<:AGGDGGGGGGGGGGGFFDFEGGGGGGGGGGGGGEDGGGGFGDGGGG AS:i:-6 XN:i:0 XM:i:2
In the example above, my coworker's bam does not appear to have a chromosome id at all. Instead of chromosome and position, it just gives a refseq ID (gi|311893327|ref|NM_001198861.1). I'm not sure how to work around this.
Since your BAM is aligned to refseq mRNA (at least that one example) no standard GTF file is going to work for counting (unless you make one up yourself) or you could convert the bam back into fastq's and then align to the genome again. Then you can use a standard GTF file.