problem in EnsemblID to gene symbol conversion in TCGA data
2
3
Entering edit mode
6.5 years ago

Dear all,

I am working on read count files obtained from TCGA. Each read count file contains read counts for 60483 ensemblID.

I used Biomart to convert EnsemblIDs to gene symbols. However only for around 19000 ensemblIDs I found their equivalent gene symbols.

Strangely some EnsemblIDs are not recognized by Ensembl itself.

I have a list of differential expressed genes in which around 80% genes have only ensemblID with no name.

Since I want to use different enrichment or pathway databases, these IDs are problematic and are not detected by these databases.

Can you advise me how to tackle this problem?

Nazanin

RNA-Seq TCGA read counts ensembl • 5.8k views
ADD COMMENT
1
Entering edit mode

Hello Nazanin,

could you show us some of the IDs which cannot be converted to a gene symbol?

fin swimmer

ADD REPLY
0
Entering edit mode

Hi,

Sure.

Here is some of my IDs:

ENSG00000011465.15 ENSG00000012223.11 ENSG00000016402.11 ENSG00000034971.13 ENSG00000050767.14 ENSG00000063127.14 ENSG00000064270.11 ENSG00000066382.15

ADD REPLY
1
Entering edit mode

Hello nazaninhoseinkhan,

you could choose two ways to get the gene names:

  • Go to Ensembl's hg19 Biomart and query there without the version number of the gene
  • As tujuchuanli said, download the annotation file on the page he/she linked and use grep/awk to find the gene names

fin swimmer

ADD REPLY
1
Entering edit mode

nazaninhoseinkhan : You can go to current BioMart from main Ensembl page (no need to go to hg19 BioMart) and search the gene ID's without the version numbers.

ADD REPLY
0
Entering edit mode

Yes, one can have luck that this work. But it wouldn't suprise me if the current ensembl release skippes some genes from the former version or that the official gene symbol has changed.

So I think it is always a better idea to use the same reference assembly in each step. I suggested hg19 here because the version numbers from the examples above are from hg19.

fin swimmer

ADD REPLY
0
Entering edit mode

Ensembl was supposed to have redirects in place for stale Ensembl ID's. It was acknowledged as a problem in a discussion here (possibly before you joined). That fix may have been implemented already.

ADD REPLY
0
Entering edit mode

You could give Ensembl ID converter a try.

ADD REPLY
0
Entering edit mode

Hi,

Unfortunately when I used Ensembl ID converter, I got this message:" no stable IDs could be mapped"

ADD REPLY
4
Entering edit mode
6.5 years ago
Erin Haskell ▴ 470

Hi there,

TL;DR - remove the versioning e.g. ENSG00000011465.15 -> ENSG00000011465

With BioMart you will need to make sure that you choose the correct format when filtering by a list of IDs. There are two relevant options, one for 'Gene Stable ID(s)', and 'Gene stable ID(s) with version'. Your IDs have the version e.g. ENSG00000063127.14, rather than just ENSG00000063127. If you use version 92 of BioMart with the Ensembl ID version it will only give you results if that version of the ID is the current version in Ensembl 92. I would therefore advise that you strip out all of the version endings to the IDs (remove the . and following numbers).

This is also important if you want to use the Ensembl ID converter, you need to remove the versioning from the IDs - here's an example using your IDs that haven't worked. You can see that all of these are not the current versions in Ensembl version 92.

ADD COMMENT
0
Entering edit mode

Sorry - forgot to mention that you can also use the REST API xref ID endpoint to access this information. You can supply a list of ENSG IDs (without version numbers) and specify the database. If you're looking for gene names/symbols you can specify HGNC as the database.

ADD REPLY
0
Entering edit mode

Hi Erin,

Thank u so much.

I removed the versions from IDs and most of the IDs were mapped.

I also thank all guys who help me for this post: tujuchuanli , genomax and finswimmer

ADD REPLY
0
Entering edit mode

Agreed. I was using TCGA gene expression quantification files (60,000 genes) and GTEx v8 from a specific tissue.

Tried matching on gene symbol but was left with 20,000 genes missing after joining.

Tried dropping .# from gene id and was only left with 5,000 genes missing after joining. However, there were 43 duplicate gene ids after I trimmed #.

ADD REPLY
2
Entering edit mode
6.5 years ago
tujuchuanli ▴ 130

the annotation using by TCGA is gencode V22 and it can be downloaded from https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files

ADD COMMENT

Login before adding your answer.

Traffic: 1822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6