Exon counts do not match gene and transcript counts
0
2
Entering edit mode
2.1 years ago
firestar ★ 1.6k

I have bulk rnaseq data and I am interested in 1 gene, it's transcripts and exons. I ran the rnaseq nf-core pipeline (v3.9) (which uses salmon) to generate gene counts and transcript counts. And then I ran featureCounts on the BAM files to generate exon counts.

I assume that total gene count == sum of all transcript counts == sum of all exon counts

These are the counts for this one gene. Values are mean of raw counts over n samples. Not normalised or scaled.

  • Gene count: 11446
  • Sum of all transcripts count: 11494
  • Sum of all exons count:
    • M+O+f+: 31740
    • M+O-f+: 486
    • M-O+f+: 30775
    • M-O-f+: 484
    • M+O+f-: 31740
    • M-O+f-: 30775
    • M-O-f-: 1253
    • M+O-f-: 1309

Total gene count does closely match sum of all transcript counts but not the exon counts. There are 8 variations of exon counts which is explained below.

This is the script used for featureCounts. I have single-end reads.

module load subread/2.0.0

featureCounts \
-a "genes.gtf" \
-o "counts-exon.txt" \
-F "GTF" \
-t "exon" \
-g "exon_id" \
-s 0 \
-T 4 \
-f \
-M \
-O \
*.bam

I included or excluded the parameters -M, -O and -f

-O  Assign reads to all their overlapping meta-features 
-M  Multi-mapping reads will also be counted.
-f  Perform read counting at feature level 

which gives rise to the 8 variations of exon counts. But none of them matches the expected count of roughly 11500.

  • I understand that gene count and sum of transcript counts are very close since they come from the same salmon pipeline. I expect the featurecounts based exon counts to be further away but not wildly different. Is there an explanation for this mismatch?
  • Is there a parameter in featureCounts that is set incorrectly?
  • Is there a way to generate exon counts using the rnaseq pipeline or salmon, so I don't need to use featureCounts?

Ideas, suggestions and solutions are appreciated.

salmon rnaseq nf-core featurecounts exon • 1.4k views
ADD COMMENT
1
Entering edit mode

Not exactly the same question, but a similar topic was discussed recently here.

ADD REPLY
0
Entering edit mode

I am the author of the other post, which has been mentioned by Mensur Dlakic

I just want to add that, today, I still haven't found an explanation for this problem with featureCounts. I have inspected my .gtf files and I could not really find any issues.

If you find any possible explanation for this problem, please keep me updated because at this point I really want to know what can be the reason.

ADD REPLY
0
Entering edit mode

Maybe an option could be to contact directly the authors of the tools :). Indeed it is an interesting issue. Maybe related of how multi-mapping reads are counted, how permissive is each tool with mismatches or reads in splicing sites, etc.

ADD REPLY
0
Entering edit mode

Is the nf-core BAM file in genomic or transcriptomic space? That could be an additional caveat as a genome alignment could result in slightly different assignment of reads than a transcriptomic-centered quantification. Does salmon use a genome decoy in the nf-core implementation?

ADD REPLY
1
Entering edit mode

When using just salmon, yes, it'll use a decoy genome and quantify via selective alignment:

https://github.com/nf-core/rnaseq/blob/master/modules/nf-core/modules/salmon/index/main.nf

If it's run after alignment via STAR, I'm less sure, but I think it still uses the decoys.

ADD REPLY

Login before adding your answer.

Traffic: 1434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6