Hi there,
Quick question:
are the gene annotations specific to the sequencing platform for RNA-Seq?
I'm looking at a dataset produced with Illumina HiSeq 2500. I've aligned it both with Human GRCH 38 and the older 37 version. So far so good...alignment files are the same sizes.
When I then count the reads I'm getting almost 60,000 for the gtf GRCH 37 but only 17,000 for gtf GRCH 38. The gtf files are both coming from ensembl.org.
I'm not sure what's wrong. Is there any explanation for this?
Many thanks!
File size is useless when it's not an extreme value. It is an indicator of nothing, so you cannot predicate a "so far so good" statement on that.
What do those numbers refer to? Genes in your GTF file? There is a reason major new genome builds are spaced a few years apart since they can include major refinements in information content.
I am not entirely sure if version 38 really would only contain 17,000 genes? It refers to the final counts that are then used for diff. expr. analysis. Would version 38 contain significantly fewer genes?