Hello. I am trying to count at the exon level using Subread's (v1.6.3) featureCounts for my RNA-seq data (reversely stranded library prep). I noticed that for reversely stranded genes, the exon counts are backwards (i.e. for a 22 exon gene, exon 1 is actually exon 22, exon 2 is exon 21, and so on). Is there a way to fix this?
Here is my code [from part of a loop so $i is my file and $gtf is the path to the gtf file with exon_id:
featureCounts -p \
-F GTF \
-s 2 \
-a $gtf \
-O \
-t exon -g exon_id \
-o counts/$i'_counts.txt' \
../aligned/$i'_mapped.sam'
Attached is a photo of IGV for an example gene and the output counts for this gene. You can see that I expect 0 counts for exons 2-4 and many counts for exon 22 but the data shows it backwards.
Counts for the transcript in IGV screenshot:
Any guidance is appreciated. I want to ultimately do differential expression at the exon level so summarizing by gene is not an option.
Wouldn't you be comparing exon expression between samples but still in the same gene? That should take care of possible problems with the exon numbering convention. Exons are functional at mRNA level, which is why the exon with the most 3' genomic position gets named the first exon, as it's the most 5' in the pre-mRNA. As long as feature IDs are unique and consistent across samples, differential expression should not be affected by the name any feature is given.
@Ram Yes that's true. I thought it would be more clear to present 'exon 1' as the first exon in the mRNA (closest to the start codon)
Exon 1 is still closest to the start codon no matter what the orientation is. The whole purpose of "reverse numbering" exons is to name the exon closest to start codon as exon-1. Am I misunderstanding something?
In the table that I uploaded, ENSMUST00000170872.2.22 is actually exon 1 and ENSMUST00000170872.2.1 is actually Exon 22. This is my issue.
Ah I see. This is going to be a little tricky. You should start by getting a list of genes on the minus strand, then use R to create lookup tables that contain number of exons per transcript. With that, you can do something like
corrected_exon_number = abs(current_exon_number - total_exons) + 1
to generate new IDs for the exons.That's a good idea. Thank you!