Question

STAR quantMode geneCounts: weird outcomes

0

Entering edit mode

6.1 years ago

davide.chiarugi ▴ 20

I have ran STAR 2.5.0a on my bulk RNA-seq data, obtained using a single-end stranded library preparation strategy.

I have set --quantMode GeneCounts, to obtain the counts from the ''embedded'' htseq-count.

I have obtained results like the following:

N_unmapped           146273    146273     146273
N_multimapping      3408293   3408293    3408293
N_noFeature          355858  17068060     392326
N_ambiguous         1189135     11003     513338
ENSG00000223972           0         0          0
ENSG00000227232           2         0          2
ENSG0000027826            0         0          0
ENSG00000243485           1         1          0

Up to my knowledge:

the values in the second column represent the amount of hits that would have been obtained if the library prep. would have been not strand-specific (--stranded=no);
the third column contains the amount of hits that would have been obtained if the library prep. would have been strand specific with the ''stranded = yes'' setting;
the fourth column contains the amount of hits that would have been obtained if the library prep. would have been strand specific with the ''stranded = reverse'' setting.

Globally the results I have obtained call for a library preparation strategy consistent with the ''stranded = reverse'' setting, which is perfectly fine.

Inspecting the columns, what I would expect is that the values in the second column would represent the sum of the third and fourth columns, like this:

ENSG00000279457 17  0   17
ENSG00000248527 1260    1   1259

With the second entry calling for 1259 hits for the sense RNA and 1 hit for a possible asRNA

Anyways, I have also entries like the followings:

ENSG00000228794 126 0   129
ENSG00000187634 128 621 185
ENSG00000131584 205 15  205

How can I interpret such results ?

RNA-Seq STAR HTSeq-count • 3.3k views

ADD COMMENT • link 6.1 years ago by davide.chiarugi ▴ 20

0

Entering edit mode

Why is the outcome weird?

ADD REPLY • link 6.1 years ago by h.mon 35k

0

Entering edit mode

I have just completed the post: it was submitted incomplete by accident

ADD REPLY • link 6.1 years ago by davide.chiarugi ▴ 20

0

Entering edit mode

This would explain the cases in which you map on meta features and/or you have reads coming from both the sense and antisense RNA overlapping each other. The GFF is unlikely to have overlapping features because it is associated to the Human genome. Moreover I am mapping on features instead of meta features.

In every case, iverlapping features will not motivate entries like:

ENSG00000228794 126 0 129

ADD REPLY • link 6.1 years ago by davide.chiarugi ▴ 20

2

Entering edit mode

The coordinates for ENSG00000228794: Chromosome 1: 825,138-859,446

The coordinates for ENSG00000225880: Chromosome 1: 826,206-827,522

They overlap, and run in opposite directions. So you've likely got 3 reads that fall in the overlapped area. In an unstranded protocol, there's no way to know which gene they come from. When the software knows that reads must run reverse, it knows they go to 228794

ADD REPLY • link 6.1 years ago by swbarnes2 14k

score 2 · Answer 1 · 2018-10-25

2

Entering edit mode

6.1 years ago

swbarnes2 14k

The columns don't add up because there are overlapping features in your gtf, so the aligner can't always unambiguously assign a read to those features.

ADD COMMENT • link 6.1 years ago by swbarnes2 14k