Hi,
I know this question have been asked multiple times, but none of the mentioned answers were "satisfactory".
I managed to get a list of genes for each KEGG pathway using the kg tool (problem with KEGG pathway gene extraction). However, when I try to convert this list to other identifier type, a big problem arise.
Since I want an automatic way and from the suggestions in the mentioned question, I decided to use the python wrapper of MyGene.info
import mygene
mg = mygene.MyGeneInfo()
allGeneSymbols = ["DP2", "DP1", "MAD3L"]
out = mg.querymany(allGeneSymbols, scopes='symbol', fields='entrezgene', species='human')
It worked only for a small set of genes and the problem seems to be the naming. For example, one of the genes where no conversion can be achieved is called DP2 in the KEGG list. However, when I dig a bit more, I was able to find this gene within the MyGene.info using http://mygene.info/v2/query?q=DP2 and it is named "TFDP2"
{"hits": [{"symbol": "PTGDR2", "_id": "11251", "entrezgene": 11251, "_score": 0.7157431, "name": "prostaglandin D2 receptor 2", "taxid": 9606}, {"symbol": "TFDP2", "_id": "7029", "entrezgene": 7029, "_score": 0.6262752, "name": "transcription factor Dp-2 (E2F dimerization partner 2)", "taxid": 9606}, {"symbol": "APC", "_id": "324", "entrezgene": 324, "_score": 0.58416, "name": "adenomatous polyposis coli", "taxid": 9606}], "max_score": 0.7157431, "took": 4, "total": 3}
which shows why it has not been found using the python script!
Any suggestions on a better way to handle such a problem? I mean one option would be to get the JSON output with curl and do something with it (not the best way). Another option would be to use Reactome, but this would require re-writing everything to deal with the reactome hierarchy and get the genes and so on (unless some tool already exist to do this).
EDIT:
One more way that I found where one could get all the KEGG genes (Entrez ID) is downloading the data from GSEA (e.g., http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/5.0/c2.cp.kegg.v5.0.entrez.gmt). However, building the KEGG hierarchy from this file is simply not possible which does not solve my problem.
Does this mean that the you were not able to convert all the gene names from kegg even using mygene.info? (I am currently investigating the same problem, but have found no satisfactory solutions.)
exactly. I could not convert all the kegg genes even using mygene.info. I showed an example where I know why the conversion did not work.