Hi,
during mapping with STAR I generated my count matrix for the differential gene expression analysis with the parameter --quantMode GeneCounts
, however, I don't now which column from the output I should use for the analysis.
The help site says, it depends on how my data is stranded, however I'm not sure how to determine that in my data.
In this google group, Mr. Dobin suggests to take the 4th column, if the read counts in in the 4th column are generally higher than in 3rd column, which is the case in my data.
However he also states, that the 4th column represents the output from ht-seq with the parameter -s reverse
, which is described in the manual of ht-seq like a setting only for paired end reads (I only have singel end reads).
For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.
So should I now use the 4th column of the STAR output or the 2nd (nonstranded)?
The real question is what was the library prep procedure for your reads? The choice of column depends on the answer to this question.
Hint: for Illumina standard stranded kits, you should use the 4th column.
You're right, i was just wondering, if I have to try to get the "strand information" about the lib prep or if I can maybe see it in the read distribution. I will consider my data reverse stranded for now and also try to get to know, if my data is stranded at all :)
I just wanted to update, that I found the protocol for the RNA-Seq lib preperation and it was a stranded library :)