the "major" problem of gene identifiers conversion
4
0
Entering edit mode
9.1 years ago
Abdullah ▴ 100

Hi,

I know this question have been asked multiple times, but none of the mentioned answers were "satisfactory".

I managed to get a list of genes for each KEGG pathway using the kg tool (problem with KEGG pathway gene extraction). However, when I try to convert this list to other identifier type, a big problem arise.

Since I want an automatic way and from the suggestions in the mentioned question, I decided to use the python wrapper of MyGene.info

import mygene

mg = mygene.MyGeneInfo()

allGeneSymbols = ["DP2", "DP1", "MAD3L"]

out = mg.querymany(allGeneSymbols, scopes='symbol', fields='entrezgene', species='human')

It worked only for a small set of genes and the problem seems to be the naming. For example, one of the genes where no conversion can be achieved is called DP2 in the KEGG list. However, when I dig a bit more, I was able to find this gene within the MyGene.info using http://mygene.info/v2/query?q=DP2 and it is named "TFDP2"

{"hits": [{"symbol": "PTGDR2", "_id": "11251", "entrezgene": 11251, "_score": 0.7157431, "name": "prostaglandin D2 receptor 2", "taxid": 9606}, {"symbol": "TFDP2", "_id": "7029", "entrezgene": 7029, "_score": 0.6262752, "name": "transcription factor Dp-2 (E2F dimerization partner 2)", "taxid": 9606}, {"symbol": "APC", "_id": "324", "entrezgene": 324, "_score": 0.58416, "name": "adenomatous polyposis coli", "taxid": 9606}], "max_score": 0.7157431, "took": 4, "total": 3}

which shows why it has not been found using the python script!

Any suggestions on a better way to handle such a problem? I mean one option would be to get the JSON output with curl and do something with it (not the best way). Another option would be to use Reactome, but this would require re-writing everything to deal with the reactome hierarchy and get the genes and so on (unless some tool already exist to do this).

EDIT:

One more way that I found where one could get all the KEGG genes (Entrez ID) is downloading the data from GSEA (e.g., http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/5.0/c2.cp.kegg.v5.0.entrez.gmt). However, building the KEGG hierarchy from this file is simply not possible which does not solve my problem.

gene • 4.1k views
ADD COMMENT
0
Entering edit mode

Does this mean that the you were not able to convert all the gene names from kegg even using mygene.info? (I am currently investigating the same problem, but have found no satisfactory solutions.)

ADD REPLY
0
Entering edit mode

exactly. I could not convert all the kegg genes even using mygene.info. I showed an example where I know why the conversion did not work.

ADD REPLY
3
Entering edit mode
9.1 years ago

There are several options to get the Entrez gene IDs for genes in a KEGG pathway:

ADD COMMENT
0
Entering edit mode

Are the KEGG identifiers that look like "hsa:401105" just entrezgenes then (with a species prefix)?

ADD REPLY
1
Entering edit mode

Yes. In hsa:401105, 401105 is the Entrez gene ID.

ADD REPLY
0
Entering edit mode

Thanks for your patient help.

ADD REPLY
0
Entering edit mode

We will be waiting for an update on the kg tool :)

ADD REPLY
2
Entering edit mode
9.1 years ago
Fidel ★ 2.0k

Have you considered using other pathway database? I find Reactome to be better maintained and more informative in general (see How Do Pathway Databases Compare? ). You can easily browse pathways and get uniprot protein identifiers that can be easily converted to other identifiers. They have a nice tool to converting identifiers and find pathway enrichments.

ADD COMMENT
0
Entering edit mode

Well, Reactome data does not seem to make so much sense to me. For example, if you have a look at this file (http://www.reactome.org/download/current/Ensembl2Reactome_All_Levels.txt) which should "supposedly" contain the mapping between Ensemble ID and pathways on all levels, you find only 1423 entry for Homo Sapiens and only 321 unique Ensemble ID (which is too less data, unless I'm getting something completely wrong).

After digging a bit more into Reactome, I was able to find this non-public file (http://www.reactome.org/download/current/homo_sapiens_ensembl_gene_to_pathways.csv) which also "supposedly" contain the mapping between Human Ensemble IDs and pathways. This file contains 7126 unique Ensemble IDs which makes more sense. If this is correct, one needs to re-build the hierarchy of Reactome pathways using (http://www.reactome.org/download/current/ReactomePathwaysRelation.txt)

So here are two contradictory outputs. I assume the second one is correct, but who knows.

ADD REPLY
0
Entering edit mode

I looked at the UniProt2Reactome.txt file which contains 8723 unique uniprot identifiers. The Ensembl2Reactome file seems truncated or maybe they don't have many mappings to Ensembl, but Uniprot should be the primary and more reliable identifier which you can map to other identifiers, for example using biodbnet

You may want to contact reactome directly, they are very helpful (help@reactome.org)

ADD REPLY
1
Entering edit mode
9.1 years ago

Use database gene identifiers, they are more stable than gene names or gene symbols. The names/symbols change over time and contrary to database IDs, these changes are not tracked.

ADD COMMENT
0
Entering edit mode

Can you please expand upon what a database gene identifier is and how the asker can use them to solve his or her problem?

Is it just the field called _id above? Still, that won't help the conversion afaics.

ADD REPLY
0
Entering edit mode

I mean IDs from public databases like EnsEMBL (e.g. ENSG00000178999) or NCBI's Entrez Gene (e.g. 11251). Since KEGG references genes using Entrez gene IDs, one should retrieve these IDs from KEGG (along the symbols/names if needed) and use them for conversion.

In the example above, querying with DP2 returns two entries because this was used to name two genes which are now named TFDP2 and PTGDR2 so the only way to disambiguate is to use a database ID or accession number.

ADD REPLY
0
Entering edit mode
9.1 years ago
Abdullah ▴ 100

I came up with a weird solution to my problem. Not sure if this would help others.

What I will do is the following:

  1. download the GSEA KEGG lists (Entrez IDs) from here. However, those have only Pathway names which makes it hard to build the hierarchy.

  2. to fix this, I will curl the path that is found inside this file for each pathway, e.g.,

    curl http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_O_GLYCAN_BIOSYNTHESIS | grep -A1 'External links'
    

    and get the field: External Links which contains the KEGG ID of this pathway. Using this, I can have all the KEGG pathway IDs along with their corresponding genes.

  3. Using the pathway IDs, I can build the hierarchy using this BRET hierarchy file: http://www.genome.jp/kegg-bin/download_htext?htext=br08901.keg&format=htext&filedir=

ADD COMMENT

Login before adding your answer.

Traffic: 1573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6