I have bulk rnaseq data and I am interested in 1 gene, it's transcripts and exons. I ran the rnaseq nf-core pipeline (v3.9) (which uses salmon) to generate gene counts and transcript counts. And then I ran featureCounts on the BAM files to generate exon counts.
I assume that total gene count == sum of all transcript counts == sum of all exon counts
These are the counts for this one gene. Values are mean of raw counts over n samples. Not normalised or scaled.
- Gene count: 11446
- Sum of all transcripts count: 11494
- Sum of all exons count:
M+O+f+
: 31740M+O-f+
: 486M-O+f+
: 30775M-O-f+
: 484M+O+f-
: 31740M-O+f-
: 30775M-O-f-
: 1253M+O-f-
: 1309
Total gene count does closely match sum of all transcript counts but not the exon counts. There are 8 variations of exon counts which is explained below.
This is the script used for featureCounts. I have single-end reads.
module load subread/2.0.0
featureCounts \
-a "genes.gtf" \
-o "counts-exon.txt" \
-F "GTF" \
-t "exon" \
-g "exon_id" \
-s 0 \
-T 4 \
-f \
-M \
-O \
*.bam
I included or excluded the parameters -M
, -O
and -f
-O Assign reads to all their overlapping meta-features
-M Multi-mapping reads will also be counted.
-f Perform read counting at feature level
which gives rise to the 8 variations of exon counts. But none of them matches the expected count of roughly 11500.
- I understand that gene count and sum of transcript counts are very close since they come from the same salmon pipeline. I expect the featurecounts based exon counts to be further away but not wildly different. Is there an explanation for this mismatch?
- Is there a parameter in featureCounts that is set incorrectly?
- Is there a way to generate exon counts using the rnaseq pipeline or salmon, so I don't need to use featureCounts?
Ideas, suggestions and solutions are appreciated.
Not exactly the same question, but a similar topic was discussed recently here.
I am the author of the other post, which has been mentioned by Mensur Dlakic
I just want to add that, today, I still haven't found an explanation for this problem with featureCounts. I have inspected my .gtf files and I could not really find any issues.
If you find any possible explanation for this problem, please keep me updated because at this point I really want to know what can be the reason.
Maybe an option could be to contact directly the authors of the tools :). Indeed it is an interesting issue. Maybe related of how multi-mapping reads are counted, how permissive is each tool with mismatches or reads in splicing sites, etc.
Is the nf-core BAM file in genomic or transcriptomic space? That could be an additional caveat as a genome alignment could result in slightly different assignment of reads than a transcriptomic-centered quantification. Does salmon use a genome decoy in the nf-core implementation?
When using just salmon, yes, it'll use a decoy genome and quantify via selective alignment:
https://github.com/nf-core/rnaseq/blob/master/modules/nf-core/modules/salmon/index/main.nf
If it's run after alignment via STAR, I'm less sure, but I think it still uses the decoys.