I have ran STAR 2.5.0a on my bulk RNA-seq data, obtained using a single-end stranded library preparation strategy.
I have set --quantMode GeneCounts, to obtain the counts from the ''embedded'' htseq-count.
I have obtained results like the following:
N_unmapped 146273 146273 146273
N_multimapping 3408293 3408293 3408293
N_noFeature 355858 17068060 392326
N_ambiguous 1189135 11003 513338
ENSG00000223972 0 0 0
ENSG00000227232 2 0 2
ENSG0000027826 0 0 0
ENSG00000243485 1 1 0
Up to my knowledge:
the values in the second column represent the amount of hits that would have been obtained if the library prep. would have been not strand-specific (--stranded=no);
the third column contains the amount of hits that would have been obtained if the library prep. would have been strand specific with the ''stranded = yes'' setting;
the fourth column contains the amount of hits that would have been obtained if the library prep. would have been strand specific with the ''stranded = reverse'' setting.
Globally the results I have obtained call for a library preparation strategy consistent with the ''stranded = reverse'' setting, which is perfectly fine.
Inspecting the columns, what I would expect is that the values in the second column would represent the sum of the third and fourth columns, like this:
ENSG00000279457 17 0 17
ENSG00000248527 1260 1 1259
With the second entry calling for 1259 hits for the sense RNA and 1 hit for a possible asRNA
Anyways, I have also entries like the followings:
ENSG00000228794 126 0 129
ENSG00000187634 128 621 185
ENSG00000131584 205 15 205
How can I interpret such results ?
Why is the outcome weird?
I have just completed the post: it was submitted incomplete by accident
This would explain the cases in which you map on meta features and/or you have reads coming from both the sense and antisense RNA overlapping each other. The GFF is unlikely to have overlapping features because it is associated to the Human genome. Moreover I am mapping on features instead of meta features.
In every case, iverlapping features will not motivate entries like:
ENSG00000228794 126 0 129
The coordinates for ENSG00000228794: Chromosome 1: 825,138-859,446
The coordinates for ENSG00000225880: Chromosome 1: 826,206-827,522
They overlap, and run in opposite directions. So you've likely got 3 reads that fall in the overlapped area. In an unstranded protocol, there's no way to know which gene they come from. When the software knows that reads must run reverse, it knows they go to 228794