Hi everyone
Firstly thank you in advance for any help you can give, I am new to bioinformatics and biostars has been immensely helpful. I have human RNA-seq data that I am currently processing, I've gone through my trimming and aligning (with STAR) stages and have just used featureCounts to counts in my data.
I have tried two different methods for featureCounts both worked but varied in their count data. Firstly I used the HG38 GTF from ensembl and secondly I used the built in HG38 GTF from the RSubread package, (entrez gene)...
Both were successful but I compared corresponding genes between ensembl and entrez gene and the count data was quite different - Total number of reads also differed from 36165730 to 38752850 respectively.
Why would the total number of counts be higher in the case of entrez gene? - Seems strange considering ensemble is larger in scope.
I understand that ensembl and entrez do not completely align but the differences seemed quite dramatic, Is this normal? and if I use the entrez values is this okay considering I aligned my data using an ensmbl GTF.
I think you might find the following manuscript helpful: A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification