Hello all,
This has puzzled me for days and I couldn't find any explanation on the internet.
I noticed HTSeq-count gives the reads counts of a gene as 0. But when viewed in IGV with the accepted_hits.bam
from tophat2 alignment, there are hundreds to thousands reads aligned in that gene region (from different samples).
This is how I used HTSeq-count from HTSeq/0.6.1p1
and samtools/1.1
samtools sort -T tmp -o accepted_hits_nsort.bam -n accepted_hits.bam
htseq-count --format=bam --strand=no --order=name -a 5 --mode=union accepted_hits_nsort.bam Homo_sapiens.GRCh37.72.gtf > htseq_count_0.6.txt
However, when I break down the accepted_hits.bam and extract the chromosome where the gene is, HTSeq-count gives the right reads counts for this BAM file.
My accepted_hits.bam
contains 60,845,216 alignment using samtools view -c accepted_hits.bam
. I don't think for PE RNA-seq, file at size of 4.5G would be a problem for samtools or HTSeq-count to deal with?
My suspicions are:
- samtools name sort problem
- htseq-count bug
- out of memory
But from the try-error I have done so far, I cannot really figure out what has gone wrong here.
Could any one give any hint here? Any suggestion would be appreciated. Thanks.
Isn't it
--stranded
instead of--strand
? Default value for--stranded
is yes. If so, then probably it is assuming your reads are stranded. Although, not sure if that could be the issue.Hello ariesliu324!
It appears that your post has been cross-posted to another site: SEQanswers.
This is typically not recommended as it runs the risk of annoying people in both communities.
If it is a memory issue, I guess it should fail for all the genes.
Yes,
I also had an issue, I was using different versions of reference fasta and GTF files.
Make sure that reference fasta for mapping and GTF/GFF for feature count should be from same source and similar version.
As mentioned by komal.rathi, it's
--stranded
or-s
, but not--strand
HTH
I am having the same problem, @ariesliu324, did you manage to resolve this issue?
I have seen this before with polycistronic genes (more than one protein from one transcript). htseq-count does not assign a read when it overlaps two "genes". To htseq-count gene is simply the id attribute field (gene_id by default). Since polycistronic genes are generally given a different gene ids for each transcript/protein. Thus htseq-count does not know where to assign the read. To check for this you can do a bedtools intersect on you gtf/gff with the region of interest. If there is more than one gene_id in that region this is the likely culprit. Not sure if this is your problem, but it has bitten me a couple times. Hope that helps!
Aside: There is not really a great solution to assigning these reads. For example if the reads were assigned to each "gene" from a polycistronic locus and then you did a DE analysis and identified one of the polycistronic genes you would get all of them. If you then did a GO enrichment type analysis you would certainly enrich for any specific terms assigned to this gene. By not counting these reads (the default here) at least you don't get false positives enrichment, but you could miss important biology by not identifying a DE gene. It's really an annotation issue. When analyzing RNA data how should one handle annotation on a protein level?