I understand that UCSC/hg19 positions are 0-based whereas GRCh38 positions are 1-indexed. However, when comparing feature positions on the hg19 ensGene.txt file with the same features on Homo_sapiens.GRCh38.76.gtf
, the positions were completely off. For example, if you try picking any protein_coding
transcript from the GRCh38 gtf file and compare its start/end positions, exon start/end positions, CDS positions, etc. with its positions on the ensGene.txt file, the positions are often off by a few thousand. I have also checked the gtf file in GRCh37 (which should be identical to hg19), but the positions were again way off. Can anyone explain why this is?
hg19 == GRCh37
hg19 != GRCh38
Given that you knew that hg19 is GRCh37 and not GRCh38, I'm confused why you're confused
In the second part are you comparing GRCh37 to hg19 or GRCh37 to GRCh38? If the former I'm not sure why they would be off, if the later it is for the same reason as GRCh38 vs hg19..... GRCh38 is a completely different assembly of the reference human genome. Its size is different, the chromosomes are different, etc. You can always only compare coordinates within an assembly, not between assemblies.