Entering edit mode
9.0 years ago
hirak.sarkar
▴
20
I was looking into Ensembl gene name to HGNC gene symbol mapping and, a snapshot of it looks like this,
ENSG00000256024.1 CT476828.6
ENSG00000256252.1 CT476828.7
ENSG00000255638.1 CT476828.4
ENSG00000256521.1 CT476828.11
ENSG00000256490.1 CT476828.10
ENSG00000238720.1 CT476828.2
ENSG00000148828.5 CT476828.1
Now from http://www.gencodegenes.org/gencodeformat.html I understood the Ensembl gene ids has the version number appended with them, but I wonder what are the dot appended values for the hgnc names? Are they unique? If I remove the dots, then many Ensembl names would be mapped to same HGNC gene symbol. Can anyone explain the naming protocol?
Thanks
First, don't use the version number for Ensembl IDs. Second, CT476828.1 is not a HGNC gene name. HGNC gene names are like "polo-like kinase 1" with associated gene symbol "PLK1". CT476828.1 looks more like a contig ID to me. Ensembl gene names and gene symbols are taken from HGNC so once you have the gene ID, you can directly get the gene name either with BioMart or the API:
Thanks for clearing the confusion. I still don't understand what the appended ".1", ".2" etc signify. Also I thought they were gene names because I prepared the mapped list from a gtf file. Here is a line from the gtf file which refers this symbol as gene name.
Also edited the question mentioning gene symbols.
The .1 after ENSG00000256024 is the version number. I can't really think of a use for it because if the gene changes significantly then it gets a new ID. Also many tools don't recognize it.
It looks like the CTxxxxx correspond to novel non-coding transcripts and that the genes were named after the corresponding transcripts. Gene symbols are generally associated with well characterized genes so novel genes usually get some sort of ID as name.
I think my confusion comes from the distinction between novel genes and well known genes.
Thanks for the help!