Dear Bioinformaticians,
I have two naive questions regarding T2T-CHM13 which I struggle to fully understand.
For the newly released complete human haploid genome T2T-CHM13, I think a total of 99 novel protein coding genes were identified, and that these have corresponding closest GENCODE ID with GRCh38, thus these are paralogous genes of the latter. At the same time, there were also missing genes and transcripts that were found in GRCh38 but not longer present in CHM13.
However, when I look into the Refseq genome annotations on T2T-CHM13, I can still see that the missing genes are present in the gtf file, for example FKBP4P2. May I ask what is the reason behind this, or did I use the wrong gtf file?
Also, what is the difference between a missing gene and a missing transcript? I'm guessing a particular gene can encode multiple transcripts, but I find it hard to visualise how a specific transcript can be missing, while the gene is not. Thanks for your help!
Providers like NCBI do/add their own annotations so it is possible to see differences like this depending on source of annotations.
FKBP4P2 is marked as a pseudo-gene in the RefSeq annotation.
In this case, should I just disregard these missing genes although there may be expression?