I know that for RNASeq the default option is to count reads based on exons (-t 'exon'
in case of counting with featureCounts), and based on my understanding from previous questions like here and here and -poorly described- subread documentation I have realised that by counting the reads based on genes (-t 'gene'
) we would end up including transcripts aligned to intron regions to our final count table for each gene as well. If this is correct I would assume that by changing 'exon' to 'gene' one should end up with more counts for most genes (both cases we summarise at gene level or -g 'gene_id'
) and overall more read counts, however it seems that by setting feature type to 'gene' we actually end up with less reads assigned:
analysing a test BAM file with:
featureCounts -p --countReadPairs -M -T 8 \
-t 'gene' \
-g 'gene_name' \
-a gencode.v38.annotation.gtf \
-o test.txt \
test.bam
would result in:
//================================= Running ==================================\\
|| ||
|| Load annotation file gencode.v38.annotation.gtf ... ||
|| Features : 60649 ||
|| Meta-features : 59385 ||
|| Chromosomes/contigs : 25 ||
|| ||
|| Process BAM file test.bam... ||
|| Paired-end reads are included. ||
|| Total alignments : 30471464 ||
|| Successfully assigned alignments : 24302041 (79.8%) ||
|| Running time : 0.29 minutes ||
while analysing the same BAM file with same arguments except for default exon feature type:
featureCounts -p --countReadPairs -M -T 8 \
-t 'exon' \
-g 'gene_name' \
-a gencode.v38.annotation.gtf \
-o test.txt \
test.bam
results in 6% much more assigned reads:
//================================= Running ==================================\\
|| ||
|| Load annotation file gencode.v38.annotation.gtf ... ||
|| Features : 1499012 ||
|| Meta-features : 59385 ||
|| Chromosomes/contigs : 25 ||
|| ||
|| Process BAM file test.bam... ||
|| Paired-end reads are included. ||
|| Total alignments : 30471464 ||
|| Successfully assigned alignments : 26053717 (85.5%) ||
|| Running time : 0.25 minutes ||
Could someone please explain to me what I'm missing here?
Yeah , it is expected right. when you summarise against gene exon-exon spanning reads are counted once but in the exon level you count as twice. Since the split reads originated from two exons. Check the junction reads please.