I have been struggling to extract the set of genes related to each pathway from KEGG. I read multiple topics on biostar on how to do this but apparently there is something wrong. I downloaded the bret hierarchy file in order to get genes on the pathway level that I want. However, there seem to be multiple methods which some of them are working and the others are not.
The first method only has 401 lines for hsa05200, the latter method has close to 2000. My library uses the latter method. See https://github.com/endrebak/kg
My guess is that the first method is from the outdated kegg (it went unfunded for a while and the public data was not updated AFAIK), and the second is from the second run of KEGG, 2014-2016.
I'm able to get a list of genes by using the kg tool as shown below. Perhaps this is enough to meet your needs. I would also like to be able to specify the type of gene identifier, such as Entrez ID or Ensembl ID (link to issue), but this has not been implemented yet as far as I can tell.
Command:
kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 01200)
Output: (trimmed)
Cache path is: /Users/slowikow/.kegg/ (Time: Thu, 29 Oct 2015 10:10:24 )
Get KEGG gene to external gene map. (Time: Thu, 29 Oct 2015 10:10:24 )
Get KEGG path to gene map. (Time: Thu, 29 Oct 2015 10:10:25 )
Get KEGG pathway to definition map. (Time: Thu, 29 Oct 2015 10:10:25 )
Connect KEGG gene map and KEGG pathway map. (Time: Thu, 29 Oct 2015 10:10:26 )
/Users/slowikow/anaconda/lib/python2.7/site-packages/ebs/merge_cols.py:74: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
in_df = in_df.sort(pre_merge_sort_order_col)
Attaching pathway definitions. (Time: Thu, 29 Oct 2015 10:10:26 )
01200 ME3 Carbon metabolism
01200 PGAMA Carbon metabolism
01200 PGAM-B Carbon metabolism
01200 HEL-S-35 Carbon metabolism
01200 PGAM1 Carbon metabolism
01200 PFKF Carbon metabolism
Unfortunately, the truth is more complicated than one might expect. Indeed, biomartian does not retrieve the Entrez IDs for all genes, but the genes do in fact have associated Entrez IDs.
PGAMA is 5223, but it is also known as PGAM1; PGAM-B; HEL-S-35
PGAM-B is 441531, but is also known as PGAM4; PGAM1; PGAM3; dJ1000K24.1
HEL-S-35 is probably 5223
PFKF is 5214, but is also known as PFKP; PFK-C; PFK-P; ATP-PFK
PFK-P is probably 5214
I prefer to use identifiers like Entrez or Ensembl for data analysis to avoid the ambiguity of gene names. On the other hand, I use HGNC names for presentation. Endre, you might like to visit MyGene.info if you have not already. I think the developers have done an admirable job of dealing with name ambiguity.
ADD REPLY
• link
updated 5.2 years ago by
Ram
44k
•
written 9.2 years ago by
Kamil
★
2.3k
0
Entering edit mode
Thanks for the info, I should switch to a different gene name format by default then. I think KEGGREST offers them, but the service is devoid of documentation.
Thanks. When I run the API example, I get this error:
>>>df= get_kegg_data("rno", pathway_definitions)>>> Traceback (most recent call last):
File "<stdin>", line 1, in<module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 483, in __call__
return self._cached_call(args, kwargs)[0]
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 430, in _cached_call
out, metadata = self.call(*args, **kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 675, in call
output = self.func(*args, **kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kegg/lib.py", line 32, in get_kegg_data
"kegg_pathway")
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ebs/merge_cols.py", line 35, in attach_data
merged_df = _post_merge_cleanup(merged_df, new_col_name)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ebs/merge_cols.py", line 74, in _post_merge_cleanup
in_df = in_df.sort_values_by(pre_merge_sort_order_col)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 2246, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'sort_values_by'
That is something on your end. The CLI and API uses the same code. It probably stems from you having multiple versions of Python installed and different versions of the library installed for each. If you encounter more problems please tell me at https://github.com/endrebak/kg/issues
The first method only has 401 lines for hsa05200, the latter method has close to 2000. My library uses the latter method. See https://github.com/endrebak/kg
My guess is that the first method is from the outdated kegg (it went unfunded for a while and the public data was not updated AFAIK), and the second is from the second run of KEGG, 2014-2016.