Hi,
I have 2x50 bp human strand-specific data. I aligned them against hg19 with STAR and after that I extracted the read count per gene with htseq-count. I used refseq genes from UCSC.
But the results are quite strange... Here is a snapshot of my data. The reads shown in blue are aligning to the sense strande of the reference. SERF2 is on the sense strand and MIR1282 is on the anti-sense strand. htseq-count output:
SERF2 0
MIR1282 783
and I used :
htseq-count -s yes -i gene_id - $gtf_refseq > htseq_refseq.txt
I don't understand where my error is.
Has anyone tried htseq-count with STAR?
Here are the gtf lines of interest :
chr15 hg19_refGene exon 44085857 44085957 0.000000 - . gene_id "NR_031695"; transcript_id "NR_031695";
chr15 hg19_refGene start_codon 44084575 44084577 0.000000 + . gene_id "NM_001018108"; transcript_id "NM_001018108";
chr15 hg19_refGene CDS 44084575 44084581 0.000000 + 0 gene_id "NM_001018108"; transcript_id "NM_001018108";
chr15 hg19_refGene exon 44084040 44084581 0.000000 + . gene_id "NM_001018108"; transcript_id "NM_001018108";
chr15 hg19_refGene CDS 44085173 44085281 0.000000 + 2 gene_id "NM_001018108"; transcript_id "NM_001018108";
chr15 hg19_refGene exon 44085173 44085281 0.000000 + . gene_id "NM_001018108"; transcript_id "NM_001018108";
chr15 hg19_refGene CDS 44085908 44085968 0.000000 + 1 gene_id "NM_001018108"; transcript_id "NM_001018108";
chr15 hg19_refGene stop_codon 44085969 44085971 0.000000 + . gene_id "NM_001018108"; transcript_id "NM_001018108";
chr15 hg19_refGene exon 44085908 44088287 0.000000 + . gene_id "NM_001018108"; transcript_id "NM_001018108";
Another strange thing. When it's an isolated gene (not overlapped by an other gene) like this one in the picture below, there are only 8 reads reported by htseq-count while there are a lot more (~40 in average per position) shown in IGV.
thanks
N.
My gtf is coming from UCSC, so the first problem (overlapping genes) is coming from there. But the second one (the gene alone), I checked the reads and the are all good (alignment score : 255, NH:1 ).. I check the htseq outut and it gives me a lot of alignmentnotunique..
Again, I think what is happening is that your GTF does not distinguish between transcripts and genes. So, htseq-count treats all transcripts as if they were different genes. Thus it throws away not just reads that are non-unique because of overlapping genes but also if they are non-unique because of overlapping transcript isoforms for the same gene. I don't see any way around that other than using a different GTF file.
I redid it with ensembl gtf (from ensembl, not UCSC) and I have the same behaviour..
But when I put -s no (so htseq-count will not take the strand into account). The gene "alone" (cf figure 2), it gives me good read count (~ 600). With -s yes : read count = ~9
That's weird. Sounds like this really is a problem with how htseq-count is interpreting stranded info from STAR.
So, you are still not getting any counts for SERF2? Could you put a snippet of your ensembl gtf and the command you ran and results you get for those genes above. Just to complete the story of what you have tried so far. I am just trying to figure out htseq-count myself. Maybe we should try to invite the author to comment on this thread?