Hi guys,
I'm having troubles understanding something which I though I knew well (hard times, these). I always knew that, while Cufflinks returns FPKM, expected counts and other things, HTSeq returns raw counts. This should mean that in the output of HTSeq I should have the counts of the fragments that were mapping on my particular gene. Right?
I am graphically and command-linely assessing the fact that I have indeed around 30 fragments mapping on a gene in the wild type condition and 0 in the treatment, but htseq returns only 1 in the wild type and 0 in the treatment. This gene is knocked-down, so I know it's down-regulated also from wet lab experiments.
EDIT: Three questions:
How does HTSeq handle reads that exceed borders of the considered regions but overlap them? My guess is that it doesn't count them.
I always thought that I understood correctly this: the count should just be a count of the fragments (in case of PE reads), without any normalization, right?
What the hell is going on?
The command is:
python /usr/bin/htseq-count -f bam -r pos -t exon -i "ID" alignment.bam gene-set.gff3 > stdout 2> stderr
Thank you in advance for the help!
Are these observed fragments uniquely mapping ones? HtSeq-count only uses uniquely mapped reads (with a certain mapping quality). AFAIK, you can check both with IGV.
Checking it RN. Thanks!
@michael.ante, something strange happens.
I filter the bam files keeping only primary alignments; samtools flagstat reveals, for one example file:
However, the output of htseq-count for that same read group, in the "__" lines is:
Note: 158109+5+24730 = 182844 (99.09% of my reads!!!)
Now I have even more doubts. My htseq command is:
The gtf file has CDS and exon lines and proper gene_id field (already checked). I don't know what to think.
I can add further information:
using stranded
using unstranded
There is a sensible difference between with and without the stranded option. However, my sequencing library is stranded and I was expecting to keep that information alongside the position. I'm disappointed. Any of you knows why this might happen? Am I missing something basic?
The "__alignment_not_unique" are the counts of your multiple alignments (same holds true for samtools flagstat) - don't mix it up with the reads.
Try to use -s reverse e.g. TruSeq stranded generates reads of this flavour.
With the -s reverse option I got no significant difference (just a very little one, but not in the direction of an improvement). This, to me, means that I should have ran the analysis with reverse from the beginning, and that also running it without, in this case, doesn't alter much the results.
The strandedness was indeed the problem, and removing it worked just fine. Therefore, thanks for all the help!