counting reads with featureCounts - uniquely assign reads to transcripts
2
0
Entering edit mode
9.4 years ago
Pfs ▴ 580

Hello,

I have a question regarding featureCounts behavior.

Assume my GTF file has 1 gene, comprising 2 transcripts (G1T1, G1T2), with a total of 3 exons (E1,E2,E3):

G1T1 consists of exons E1 and E2, while G1T2 consists of exons E3 and E2.

Assume my SAM file has 4 reads, as shown below.

  • R1 overlaps E1;
  • R2 is spliced between E1 and E3;
  • R3 is spliced between E1 and E2;
  • R4 is spliced between E3 and E2.

    |____E1____|                   |____E2____|   Transcript G1T1
                   |____E3____|    |____E2____|   transcript G1T2

    R1------>
    R2---------....--->
    R3---------....................---->
                       R4-----.....---->
  • When using featureCounts with option -g transcript_id, only R1 is assigned and the counts for G1T1 is 1.
  • When using featureCoutns with option -g transcript_id -O, all reads are assigned and the counts for G1T1 is 4 (R1,R2,R3,R4), whereas the count for G1T2 is 3 (R2,R3,R4).

I understand the logic behind allowing multiply overlapping features (-O), however I would expect that at the transcript level R3 is uniquely assigned to G1T1 and R4 uniquely assigned to G1T2. In other words, I would expect that even though the parts of a spliced read are mapped to multiple features, when the reads are counted at a level where a distinction between metafeatures is possible, this distinction is used to uniquely assign the reads.

I feel that without "-O", the counts are unnecessarily low and with the option -O the counts are unnecessarily inflated. What am I missing here?

Thanks in advance!

RNA-Seq counting reads • 5.6k views
ADD COMMENT
1
Entering edit mode
9.4 years ago

One should really not use featureCounts (even with -O) to get transcript-level metrics. If you really want these numbers, use something like eXpress or another tool that has an expectation-maximization step.

BTW, what you're missing is that exon E2 is shared between both transcripts, so any alignment to it will necessarily by non-unique (aka ambiguous).

ADD COMMENT
0
Entering edit mode
9.4 years ago
mark.ziemann ★ 1.9k

Indeed, the counts can be quite low from featureCounts, but if you use the default gene-wise counting with stranded seq it is a bit better. The transcript-wise counting will be very strict without -O, especially for genes with many splice types, so you might have better luck with counting based on exons first followed by DEXSeq differential analysis. In featureCounts, you can use the -R option to see the read assignment result for each read to help you fine-tune further.

ADD COMMENT

Login before adding your answer.

Traffic: 1725 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6