This really bothers me.
I used tophats and htseq-count to get count information for a RNAseq projects (50 bp reads). I then used OSA from Omicsoft to do the same thing and get gene count numbers.
Well, I was ready to accept some difference, actually, the total mapped reads are very much similar from the two pipelines (tophat2 qc report)
My htseq-count was used as:
htseq-count -i gene_name bamfile annotation > outfile
However, at the very end, I noticed for multiple genes, the count data between arraystudio and tophat2 are dramatically different, for example, for many genes tophat2 outputed 1 or 0 for particular gene, arraystudio aligner give counts ranges from several to hundreds of reads.
For all that I checked, for some highly expressed gene, tophat2 pipeline reported 100 times more reads than arraystudio, seriously I actually believe arraystudio reads are more accurate, since there are certain genes we know should show decent expression are essentially reported as not expressed by tophat2.
Well, I have two questions:
- Anyone every seen this dramatic different different aligne/quantifications?
- As tophat2 is widely used, does this mean I must have done something with the pipeline? -- given the high similar reported overall aligned pairs, I would say the two aligners should have done similarly
- Then can htseq-count be the cause of such difference? I used
gene_name
, notgene_ID
for quantification of counts - How do I qc this? can I extract a particular gene from my tophat2 output .bam files and look at them?
I understand this is complicated but really appreciate your time and any input is welcome, this is really shocking for the difference I observed and I am almost sure something must have gone wrong after the tophat2.
Thanks
Can you get a BAM or SAM file from OSA? If so, run that through htseq-count. If the number are similar to what you get with tophat2 then you know OSA is screwing up the counting. If they're not, then you know that the alignment is being done poorly by tophat2. I should note that htseq-count is working correctly, if something disagrees with it substantially then that thing is either wrong or working under completely different assumptions (e.g., using an EM method).
that's a very good suggestion... just did that... however I think the bam file from arraystudio has a different format... and htseq-count complained:
well, it fixed by first samtools index thebamfile
I actually extracted the region of interest from bam files from two different aligners, I do not see a huge difference...
the problem may lay after bam generation...
looks like htseq-count may have issue:
convert bam file to raw read count file for DESeq or EdgeR
someone may have seen similar issues
That's a two year old thread and pretty much any bug reported then will have gotten fixed by now.