Question

HTSeq with transcriptome alignment- finding the right gtf

0

Entering edit mode

8.5 years ago

k.s.moore113 • 0

I received a transcriptome-aligned bam file from a coworker, which looks a little different than what I'm used to. A sample read from his bam looks like this:

HISEQ:240:C8VGLANXX:1:1101:1007:4288    0   gi|311893327|ref|NM_001198861.1|    365 255 26M *   0   0   NTTGGACAGGCGAATTCATGGTCGTG  #<<ABGGDGGGGGGGGGGGGGGGDGB  XA:i:1  MD:Z:0T25   NM:i:1

It lacks a chromosome identifier, which is a bit strange. Still, I thought I would be able to use a refseq GTF to count the reads with HTSeq, since a refseq identifier is present. My GTF file looks like this:

chr1    ucsc_refGene    exon    4782568 4782733 .   -   .   gene_id "Mrpl15"; transcript_id "NM_001177658"; exon_number "3"; exon_id "NM_001177658.3"; gene_name "Mrpl15";

My colleague was unsure whether his strands were reversed, so I ran HTSeq twice, once with --stranded=yes and once with --stranded=reverse. Either way, the read count tables report 0 reads for every gene:

Stranded=yes:
__no_feature    10954731
__ambiguous 0
__too_low_aQual 0
__not_aligned   5947831
__alignment_not_unique  0

Stranded=reverse
__no_feature    10954731
__ambiguous 0
__too_low_aQual 0
__not_aligned   5947831
__alignment_not_unique  0

I invoked HTSeq like this:

python -m HTSeq.scripts.count -f bam -r name --type=CDS --idattr=gene_id --mode=union --stranded=reverse blahblah.bam ucsc_refseq.gtf > blahblah-CDS-reverse.txt

I suppose the problem is that this isn't really the right gtf to use with this bam, but I'm not sure where I can get right kind, or what it should look like. Help?

RNA-Seq sequencing next-gen HTSeq • 3.9k views

ADD COMMENT • link updated 8.5 years ago by GenoMax 147k • written 8.5 years ago by k.s.moore113 • 0

0

Entering edit mode

Hi, Not sure if this is the right answer. Does your bam file has a header? It happened to me before that the alignment was done with a reference where the chromosomes were identified as '1, 2, 3 ...' and in the gtf file they are identified as 'chr1, chr2, chr3, ...'. I believe that if this is the case you can just modify the chromosome names in the gtf file. You can do: sed 's/^chr/^/g' gtf_file -i

ADD REPLY • link 8.5 years ago by mjg ▴ 30

0

Entering edit mode

I've had that problem too, and I don't think this is the same thing. Normally, my bam files look like this:

HISEQ:240:C8VGLANXX:1:1101:1000:59382 321 chr9 49079038 0 50M chr15 81972436 0 NCTTGGTCCTGTTTCCTCCTGTGGCCCGGAGTTTGATGTGCAGGGCAGTG #3<:AGGDGGGGGGGGGGGFFDFEGGGGGGGGGGGGGEDGGGGFGDGGGG AS:i:-6 XN:i:0 XM:i:2

In the example above, my coworker's bam does not appear to have a chromosome id at all. Instead of chromosome and position, it just gives a refseq ID (gi|311893327|ref|NM_001198861.1). I'm not sure how to work around this.

ADD REPLY • link 8.5 years ago by k.s.moore113 • 0

0

Entering edit mode

Since your BAM is aligned to refseq mRNA (at least that one example) no standard GTF file is going to work for counting (unless you make one up yourself) or you could convert the bam back into fastq's and then align to the genome again. Then you can use a standard GTF file.

ADD REPLY • link 8.5 years ago by GenoMax 147k

0

Entering edit mode

8.5 years ago

mjg ▴ 30

Sorry I added the answer as a comment..

Hi, Not sure if this is the right answer. Does your bam file has a header? It happened to me before that the alignment was done with a reference where the chromosomes were identified as '1, 2, 3 ...' and in the gtf file they are identified as 'chr1, chr2, chr3, ...'. I believe that if this is the case you can just modify the chromosome names in the gtf file. You can do: sed 's/^chr/^/g' gtf_file -i

ADD COMMENT • link 8.5 years ago by mjg ▴ 30

0

Entering edit mode

Well, I replied without noticing it was a comment instead of an answer. See above.

ADD REPLY • link 8.5 years ago by k.s.moore113 • 0

0

Entering edit mode

This solution is not applicable in this case.

ADD REPLY • link 8.5 years ago by GenoMax 147k

score 3 · Accepted Answer · 2016-05-11

3

Entering edit mode

8.5 years ago

michael.ante ★ 3.9k

Since it's already aligned to the transcriptome, you can use for instance RSEM.

While aligning to the transcriptome you will encounter a lot of multi-mapping reads. Each time two or more transcripts share one exon, a read landing to that exon will be counted multiple times. Therefore, you don't want to use HTSeq-count.

ADD COMMENT • link 8.5 years ago by michael.ante ★ 3.9k

0

Entering edit mode

I've never used RSEM before. It looks quite difficult. I'm a lowly PhD student in a lab with no real bioinformatics support, and nearly everything I know about bioinformatics in general and RNAseq in particular I've taught myself via online tutorials. I know it's a bit of a faux pax to ask this, but is there anything out there that's a little more user friendly?

Edited to add: The only thing my colleague really wants to know is the read distribution over transcript features, i.e. how many reads fall within the 5'UTR vs the coding sequence etc. I thought this would be easy to do with HTSeq count, but... nope.

ADD REPLY • link 8.5 years ago by k.s.moore113 • 0

0

Entering edit mode

To be honest I would just convert the bam back to fastq files (or ask for the original fastq files) and do a proper mapping to the genome, that might save you some trouble (and you will find nice explanations on how to do that).

ADD REPLY • link 8.5 years ago by WouterDeCoster 47k

0

Entering edit mode

In that case, I think I'll be better off asking for the original fastq files. This is a ribofootprinting experiment, which means that the reads are much shorter than is normal for RNAseq. That's why they were mapped to the transcriptome in the first place; I'm told that many aligners have difficulty mapping short reads over splice junctions.

That said, I've since heard that STAR is capable of managing that problem better than most alignment tools, so perhaps that will be a good solution.

ADD REPLY • link 8.5 years ago by k.s.moore113 • 0

0

Entering edit mode

Right, I'm unsure about the correct workflow for ribofootprinting, was assuming you had just standard RNA-seq. Not sure what's your best guess then.

ADD REPLY • link 8.5 years ago by WouterDeCoster 47k

0

Entering edit mode

The only thing my colleague really wants to know is the read distribution over transcript features, i.e. how many reads fall within the 5'UTR vs the coding sequence etc

You don't need HTseqcounts for that. An "easy way" would be to compute the read coverage (with samtools depth or bedtools genomecov for instance) for all genes, then you can plot that coverage for separated genes, or with some extra-work, as a meta-gene and see where your reads pile up.

ADD REPLY • link 8.5 years ago by Carlo Yague 8.9k