Question

counting reads with featureCounts - uniquely assign reads to transcripts

0

Entering edit mode

9.8 years ago

Pfs ▴ 580

Hello,

I have a question regarding featureCounts behavior.

Assume my GTF file has 1 gene, comprising 2 transcripts (G1T1, G1T2), with a total of 3 exons (E1,E2,E3):

G1T1 consists of exons E1 and E2, while G1T2 consists of exons E3 and E2.

Assume my SAM file has 4 reads, as shown below.

R1 overlaps E1;
R2 is spliced between E1 and E3;
R3 is spliced between E1 and E2;
R4 is spliced between E3 and E2.


    |____E1____|                   |____E2____|   Transcript G1T1
                   |____E3____|    |____E2____|   transcript G1T2

    R1------>
    R2---------....--->
    R3---------....................---->
                       R4-----.....---->

When using featureCounts with option -g transcript_id, only R1 is assigned and the counts for G1T1 is 1.
When using featureCoutns with option -g transcript_id -O, all reads are assigned and the counts for G1T1 is 4 (R1,R2,R3,R4), whereas the count for G1T2 is 3 (R2,R3,R4).

I understand the logic behind allowing multiply overlapping features (-O), however I would expect that at the transcript level R3 is uniquely assigned to G1T1 and R4 uniquely assigned to G1T2. In other words, I would expect that even though the parts of a spliced read are mapped to multiple features, when the reads are counted at a level where a distinction between metafeatures is possible, this distinction is used to uniquely assign the reads.

I feel that without "-O", the counts are unnecessarily low and with the option -O the counts are unnecessarily inflated. What am I missing here?

Thanks in advance!

RNA-Seq counting reads • 5.9k views

ADD COMMENT • link updated 4.0 years ago by Ram 45k • written 9.8 years ago by Pfs ▴ 580

Ram · Answer 1 · 2015-07-03

1

Entering edit mode

9.8 years ago

Devon Ryan 105k

One should really not use featureCounts (even with -O) to get transcript-level metrics. If you really want these numbers, use something like eXpress or another tool that has an expectation-maximization step.

BTW, what you're missing is that exon E2 is shared between both transcripts, so any alignment to it will necessarily by non-unique (aka ambiguous).

ADD COMMENT • link updated 4.0 years ago by Ram 45k • written 9.8 years ago by Devon Ryan 105k

Ram · Answer 2 · 2015-07-03

Indeed, the counts can be quite low from featureCounts, but if you use the default gene-wise counting with stranded seq it is a bit better. The transcript-wise counting will be very strict without -O, especially for genes with many splice types, so you might have better luck with counting based on exons first followed by DEXSeq differential analysis. In featureCounts, you can use the -R option to see the read assignment result for each read to help you fine-tune further.