I used featureCount for counting reads in a bam file.
featureCount notably outputs the length of each gene.
However, I noticed huge differences in number of gene exons and length between the 'hg38_RefSeq_exon.txt" annotation file provided with featureCount and the last version of GENCODE gtf. Why is that?
Maybe the following approach is not correct:
(1) keeping exon features only in GENCODE v28 GRCh38.p12 gtf
(2) sort the gtf subset by chromosome and start position:
chr1 11869 12227 ENSG00000223972
chr1 12010 12057 ENSG00000223972
chr1 12179 12227 ENSG00000223972
chr1 12613 12697 ENSG00000223972
chr1 12613 12721 ENSG00000223972
chr1 12975 13052 ENSG00000223972
...
(3) get the union of overlapping exons from the same genes:
> bedtools merge -i <sorted_gtf_subset> -d 0 -c 4 -o distinct
chr1 11869 12227 ENSG00000223972
chr1 12613 12721 ENSG00000223972
chr1 12975 13052 ENSG00000223972
chr1 13221 14501 ENSG00000223972,ENSG00000227232
chr1 15005 15038 ENSG00000227232
chr1 15796 15947 ENSG00000227232
chr1 16607 16765 ENSG00000227232
chr1 16858 17055 ENSG00000227232
...
(4) for each gene, sum the difference between start / end position of merged exons.
For example, featureCount outputs a length of 20,654 for ENSG00000237298 (RefSeq ID: 100506866) with 22 exons in its annotation file 'hg38_RefSeq_exon.txt'.
Using the method above with GENCODE gtf, I get 63 exons for this same gene and a length of 62,661 (difference of 42,007 bases !!!). I also noticed that 50 of these exons are also shared with other genes (50 exons with ENSG00000155657 and 1 with ENSG00000270956)
Could someone explain this difference to me?
Have you examined the GENCODE GTF file carefully? It is possible that there are overlapping annotations in that regions that is leading to the much higher count using your method.
As you point out the RefSeq entry is clear with 22 exons marked in it and the longest transcript included in ENSG00000237298 also points to that same entry.
I have just used the same approach with the latest RefSeq gff3. It gives the exact same 22 exons as in featureCount. I don't see what other filter I could apply to the GENCODE gtf (the source maybe?)