Question

HGNC gene symbols

1

Entering edit mode

9.0 years ago

hirak.sarkar ▴ 20

I was looking into Ensembl gene name to HGNC gene symbol mapping and, a snapshot of it looks like this,

ENSG00000256024.1  CT476828.6
ENSG00000256252.1  CT476828.7
ENSG00000255638.1  CT476828.4
ENSG00000256521.1  CT476828.11
ENSG00000256490.1  CT476828.10
ENSG00000238720.1  CT476828.2
ENSG00000148828.5  CT476828.1

Now from http://www.gencodegenes.org/gencodeformat.html I understood the Ensembl gene ids has the version number appended with them, but I wonder what are the dot appended values for the hgnc names? Are they unique? If I remove the dots, then many Ensembl names would be mapped to same HGNC gene symbol. Can anyone explain the naming protocol?

Thanks

gene ensembl hgnc • 6.3k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.0 years ago by hirak.sarkar ▴ 20

4

Entering edit mode

First, don't use the version number for Ensembl IDs. Second, CT476828.1 is not a HGNC gene name. HGNC gene names are like "polo-like kinase 1" with associated gene symbol "PLK1". CT476828.1 looks more like a contig ID to me. Ensembl gene names and gene symbols are taken from HGNC so once you have the gene ID, you can directly get the gene name either with BioMart or the API:

my $Ensgene = $gene_adaptor->fetch_by_stable_id($EnsemblID);
my $HGNC = $Ensgene->external_name();

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for clearing the confusion. I still don't understand what the appended ".1", ".2" etc signify. Also I thought they were gene names because I prepared the mapped list from a gtf file. Here is a line from the gtf file which refers this symbol as gene name.

GL000228.1      ENSEMBL gene    92463   94085   .       +       .       gene_id "ENSG00000256024.1"; transcript_id "ENSG00000256024.1"; gene_type "pseudogene"; gene_status "NOVEL"; gene_name "CT476828.6"; transcript_type "pseudogene"; transcript_status "NOVEL"; transcript_name "CT476828.6"; level 3;

Also edited the question mentioning gene symbols.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by hirak.sarkar ▴ 20

0

Entering edit mode

The .1 after ENSG00000256024 is the version number. I can't really think of a use for it because if the gene changes significantly then it gets a new ID. Also many tools don't recognize it.
It looks like the CTxxxxx correspond to novel non-coding transcripts and that the genes were named after the corresponding transcripts. Gene symbols are generally associated with well characterized genes so novel genes usually get some sort of ID as name.

ADD REPLY • link 9.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I think my confusion comes from the distinction between novel genes and well known genes.

Thanks for the help!

ADD REPLY • link 9.0 years ago by hirak.sarkar ▴ 20