Question

Exon counts do not match gene and transcript counts

2

Entering edit mode

2.2 years ago

firestar ★ 1.6k

I have bulk rnaseq data and I am interested in 1 gene, it's transcripts and exons. I ran the rnaseq nf-core pipeline (v3.9) (which uses salmon) to generate gene counts and transcript counts. And then I ran featureCounts on the BAM files to generate exon counts.

I assume that total gene count == sum of all transcript counts == sum of all exon counts

These are the counts for this one gene. Values are mean of raw counts over n samples. Not normalised or scaled.

Gene count: 11446
Sum of all transcripts count: 11494
Sum of all exons count:
- M+O+f+: 31740
- M+O-f+: 486
- M-O+f+: 30775
- M-O-f+: 484
- M+O+f-: 31740
- M-O+f-: 30775
- M-O-f-: 1253
- M+O-f-: 1309

Total gene count does closely match sum of all transcript counts but not the exon counts. There are 8 variations of exon counts which is explained below.

This is the script used for featureCounts. I have single-end reads.

module load subread/2.0.0

featureCounts \
-a "genes.gtf" \
-o "counts-exon.txt" \
-F "GTF" \
-t "exon" \
-g "exon_id" \
-s 0 \
-T 4 \
-f \
-M \
-O \
*.bam

I included or excluded the parameters -M, -O and -f

-O  Assign reads to all their overlapping meta-features 
-M  Multi-mapping reads will also be counted.
-f  Perform read counting at feature level

which gives rise to the 8 variations of exon counts. But none of them matches the expected count of roughly 11500.

I understand that gene count and sum of transcript counts are very close since they come from the same salmon pipeline. I expect the featurecounts based exon counts to be further away but not wildly different. Is there an explanation for this mismatch?
Is there a parameter in featureCounts that is set incorrectly?
Is there a way to generate exon counts using the rnaseq pipeline or salmon, so I don't need to use featureCounts?

Ideas, suggestions and solutions are appreciated.

salmon rnaseq nf-core featurecounts exon • 1.4k views

ADD COMMENT • link updated 2.1 years ago by jared.andrews07 ★ 18k • written 2.2 years ago by firestar ★ 1.6k

1

Entering edit mode

Not exactly the same question, but a similar topic was discussed recently here.

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I am the author of the other post, which has been mentioned by Mensur Dlakic

I just want to add that, today, I still haven't found an explanation for this problem with featureCounts. I have inspected my .gtf files and I could not really find any issues.

If you find any possible explanation for this problem, please keep me updated because at this point I really want to know what can be the reason.

ADD REPLY • link 2.2 years ago by Marco Pannone ▴ 810

0

Entering edit mode

Maybe an option could be to contact directly the authors of the tools :). Indeed it is an interesting issue. Maybe related of how multi-mapping reads are counted, how permissive is each tool with mismatches or reads in splicing sites, etc.

ADD REPLY • link 2.2 years ago by iraun 6.2k

0

Entering edit mode

Is the nf-core BAM file in genomic or transcriptomic space? That could be an additional caveat as a genome alignment could result in slightly different assignment of reads than a transcriptomic-centered quantification. Does salmon use a genome decoy in the nf-core implementation?

ADD REPLY • link 2.2 years ago by ATpoint 86k

1

Entering edit mode

When using just salmon, yes, it'll use a decoy genome and quantify via selective alignment:

https://github.com/nf-core/rnaseq/blob/master/modules/nf-core/modules/salmon/index/main.nf

If it's run after alignment via STAR, I'm less sure, but I think it still uses the decoys.

ADD REPLY • link 2.1 years ago by jared.andrews07 ★ 18k