Convert Gene Ensemble IDs to Gene Symbols on R
2
0
Entering edit mode
2.2 years ago
Amr ▴ 180

Convert Gene Ensemble IDs to Gene Symbols on R

I tried to convert Ensemble Gene IDs to Gene Symbols by using biomart and annotations (org.Hs.eg.db) on R and biotools website but there were some genes did not convert to symbols.

Why some genes did not convert? and is there a better solution?

Thanks

Ensemble R Gene-Symbols biomart • 1.8k views
ADD COMMENT
1
Entering edit mode
2.2 years ago
Marco Pannone ▴ 810

Can you give an example? It might be pseudogenes or transcripts not currently annotated.

ADD COMMENT
1
Entering edit mode
ENSG00000276171','ENSG00000269227','ENSG00000280816','ENSG00000236269','
ENSG00000182109','ENSG00000261254','ENSG00000278882','ENSG00000273689','
ENSG00000277573','ENSG00000278939','ENSG00000252817','ENSG00000216109','
ENSG00000280316','ENSG00000260766','ENSG00000239373', 
ADD REPLY
0
Entering edit mode

If you simply google some of them you can see that these Ensembl IDs refer to transcripts of non-coding regions and not annotated regions.

ADD REPLY
0
Entering edit mode

So, they have no symbols, right?

ADD REPLY
1
Entering edit mode

So naturally, you are not going to see them associated with any "Gene Symbol" when doing the conversion from "Ensembl ID".

ADD REPLY
1
Entering edit mode

Novel genes used to be assigned temporary cryptic placeholder symbols like AC010680.1 or LINC02050 or C1orf43. They recently stopped doing that in favor of just using Ensembl IDs, since those symbols were not particularly helpful. There was a blog post somewhere about this, but I can't find it.

ADD REPLY
0
Entering edit mode

But how I can see their symbols in the unnormalized data? How their symbols have been obtained?

ADD REPLY
0
Entering edit mode

I do not know why you are mentioning normalization now since it has nothing to do with the ID of a transcript in your dataset. However, I guess by even simple intuition if a transcript comes from a coding region you would expect it to have also a "Gene Symbol". Otherwise, if the transcript comes from a non-coding region, you would not expect it to annotate to any "Gene Symbol". Transcripts from non-coding regions still have "Ensembl ID" (for example, see here how this is possible: https://www.ensembl.org/info/genome/genebuild/ncrna.html).

I tried my best to explain it in the most simple way, so I hope it is clear. But I would recommend you to do some more reading because these are pretty basic and straightforward concepts.

ADD REPLY
1
Entering edit mode
2.2 years ago
ngarber ▴ 60

The biomaRt package is the best way to do it, I believe, although I personally am using an equivalent package in Python. However, the principle is the same:

Some genes don't have names, especially if they're newly predicted, and you just have to identify them with their Ensembl Gene ID (ENSG#). Occasionally, if you google around, you might find a name for some of them, but they aren't in the BioMart database.

Some of the unnamed ones may receive names in future releases of BioMart.

ADD COMMENT

Login before adding your answer.

Traffic: 1686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6