HTSeq-count wrong reads count
0
2
Entering edit mode
10.2 years ago
ariesliu324 ▴ 20

Hello all,

This has puzzled me for days and I couldn't find any explanation on the internet.

I noticed HTSeq-count gives the reads counts of a gene as 0. But when viewed in IGV with the accepted_hits.bam from tophat2 alignment, there are hundreds to thousands reads aligned in that gene region (from different samples).

This is how I used HTSeq-count from HTSeq/0.6.1p1 and samtools/1.1

samtools sort -T tmp -o accepted_hits_nsort.bam -n accepted_hits.bam
htseq-count --format=bam --strand=no --order=name -a 5 --mode=union accepted_hits_nsort.bam Homo_sapiens.GRCh37.72.gtf > htseq_count_0.6.txt

However, when I break down the accepted_hits.bam and extract the chromosome where the gene is, HTSeq-count gives the right reads counts for this BAM file.

My accepted_hits.bam contains 60,845,216 alignment using samtools view -c accepted_hits.bam. I don't think for PE RNA-seq, file at size of 4.5G would be a problem for samtools or HTSeq-count to deal with?

My suspicions are:

  1. samtools name sort problem
  2. htseq-count bug
  3. out of memory

But from the try-error I have done so far, I cannot really figure out what has gone wrong here.

Could any one give any hint here? Any suggestion would be appreciated. Thanks.

RNA-Seq • 5.8k views
ADD COMMENT
2
Entering edit mode

Isn't it --stranded instead of --strand? Default value for --stranded is yes. If so, then probably it is assuming your reads are stranded. Although, not sure if that could be the issue.

ADD REPLY
1
Entering edit mode

Hello ariesliu324!

It appears that your post has been cross-posted to another site: SEQanswers.

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
1
Entering edit mode

If it is a memory issue, I guess it should fail for all the genes.

ADD REPLY
0
Entering edit mode

Yes,

I also had an issue, I was using different versions of reference fasta and GTF files.

Make sure that reference fasta for mapping and GTF/GFF for feature count should be from same source and similar version.

As mentioned by komal.rathi, it's --stranded or -s, but not --strand

HTH

ADD REPLY
0
Entering edit mode

I am having the same problem, @ariesliu324, did you manage to resolve this issue?

ADD REPLY
0
Entering edit mode

I have seen this before with polycistronic genes (more than one protein from one transcript). htseq-count does not assign a read when it overlaps two "genes". To htseq-count gene is simply the id attribute field (gene_id by default). Since polycistronic genes are generally given a different gene ids for each transcript/protein. Thus htseq-count does not know where to assign the read. To check for this you can do a bedtools intersect on you gtf/gff with the region of interest. If there is more than one gene_id in that region this is the likely culprit. Not sure if this is your problem, but it has bitten me a couple times. Hope that helps!

Aside: There is not really a great solution to assigning these reads. For example if the reads were assigned to each "gene" from a polycistronic locus and then you did a DE analysis and identified one of the polycistronic genes you would get all of them. If you then did a GO enrichment type analysis you would certainly enrich for any specific terms assigned to this gene. By not counting these reads (the default here) at least you don't get false positives enrichment, but you could miss important biology by not identifying a DE gene. It's really an annotation issue. When analyzing RNA data how should one handle annotation on a protein level?

ADD REPLY

Login before adding your answer.

Traffic: 3271 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6