Question

Mapping Pathways To Genes

1

Entering edit mode

10.7 years ago

juncheng ▴ 220

Hi!

I want to assign pathways to a gene (protein) list like this:

gene1   path1 path2
gene2   path3 path4 path5 path6
gene3   path7 path8 path9

But what I got from kegg mapper is like this:

enterpath1   gene1 gene2
path2   gene3 gene4 gene5 gene6
path3   gene7 gene8 gene9 code here

I cannot find a way to do this from kegg mapper, I think I need a script to do this. From the KEGGREST package of Bioconductor it seams possible but low efficient.

kegg pathway • 3.8k views

ADD COMMENT • link updated 10.7 years ago by Neilfws 49k • written 10.7 years ago by juncheng ▴ 220

1

Entering edit mode

What IDs do you have for gene1, gene2 etc. ? Are they KEGG gene IDs, such as eco:b0002?

ADD REPLY • link 10.7 years ago by Neilfws 49k

0

Entering edit mode

see Get all the Genes in a Pathway (e.g. KEGG)

ADD REPLY • link 10.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

yes, I read that before, but what I want is exactly he does not want, I want list of gene and which pathway it envoles.

ADD REPLY • link 10.7 years ago by juncheng ▴ 220

0

Entering edit mode

Using bash, make a for loop of unique gene names, and grep rows where gene name matches, get first column...

ADD REPLY • link 10.7 years ago by zx8754 12k

score 4 · Answer 1 · 2014-04-01

4

Entering edit mode

10.7 years ago

Neilfws 49k

You don't say what kind of gene IDs you have. If you have, or can convert to KEGG gene IDs (such as eco:b0002), genes -> pathways is easy using the TogoWS REST service.

curl http://togows.dbcls.jp/entry/genes/eco:b0002/pathways.json
[
  {
    "eco00260": "Glycine, serine and threonine metabolism",
    "eco00270": "Cysteine and methionine metabolism",
    "eco00300": "Lysine biosynthesis",
    "eco01100": "Metabolic pathways",
    "eco01110": "Biosynthesis of secondary metabolites",
    "eco01120": "Microbial metabolism in diverse environments",
    "eco01230": "Biosynthesis of amino acids"
  }
]

ADD COMMENT • link 10.7 years ago by Neilfws 49k

0

Entering edit mode

cool, I'm trying to use it.

ADD REPLY • link 10.7 years ago by juncheng ▴ 220

0

Entering edit mode

I tried this, and it seams super for a short list of genes, isn't it? But I have a list of near 1000 genes, does there anyway to do this in a "large scale"? Thanks again.

ADD REPLY • link 10.7 years ago by juncheng ▴ 220

0

Entering edit mode

Not sure about batch retrieval - probably not possible. You could always write a loop I suppose...but there's surely a quicker way.

ADD REPLY • link 10.7 years ago by Neilfws 49k

score 2 · Answer 2 · 2014-04-01

Here's a cleaner version that assumes you've cleaned up the header and footer from the REST output.

Also, hopefully, this example is expressive enough so you can tweak it for future use.

import sys
#
# Handle i/o filenames
#
input_fname = sys.argv[1]
output_fname = sys.argv[2]
#
# Process data
#
data_dict = {}
for line in file(input_fname):
    line = line.strip()
    spl = map(lambda x: x.strip(), line.split() ) # Default split by space, clean extra whitespace(s)
    pathway = spl[0] # Pathway name
    genes = spl[1:] # List of genes
    for gene in genes:
        current_pathways = data_dict.get(gene, []) # Obtain existing list of pathways associated with gene
        current_pathways.append(pathway) # Append to existing list of pathways, if the list exists
        current_pathways = list(set(current_pathways)) # Make pathway list unique
        data_dict[gene] = current_pathways # Let's update the dictionary
#
# Flush output to file
#
ofile = file(output_fname, "w")
for gene, pathway_list in data_dict.iteritems():
    pathways = " ".join(pathway_list) # Space separated list of pathways, or your choice of delimiter
    ofile.write("%(gene)s\t%(pathways)s\n" % dict(pathways=pathways, gene=gene)) 
ofile.close()

score 1 · Answer 3 · 2014-04-01

1

Entering edit mode

10.7 years ago

Xingyu Yang ▴ 280

#!/use/bin/env python  
import sys  
paths={}
for l in sys.stdin:
    p=l.rstrip().split()
    for i in range(1,len(p)):
        if p[i] not in paths:paths[p[i]]=[]
        paths[p[i]].append(p[0])
for k in paths:
    print(k,'\t'.join(paths[k]),sep='\t')

I didn't bebug the above script, but I think it would work. I know there might be some indent issues if you just copy the python script from Biostars.

usage:

cat input | python thisscript.py

ADD COMMENT • link 10.7 years ago by Xingyu Yang ▴ 280

0

Entering edit mode

Thanks, see whether it works.

ADD REPLY • link 10.7 years ago by juncheng ▴ 220

score 1 · Answer 4 · 2014-04-01

I tried it using Perl Hashes of arrays.

    open FH,"input.txt";
    %HoA=();
    print "___________DATA_________\n";
    while(<FH>)
    {
    print "$_";
        @array=split;
        for($i=1;$i<=$#array;$i++)
        {
        push @{ $HoA{$array[$i]} }, "$array[0]";
        }
    }
    print "___________RESULT_________\n";
    foreach $k (sort keys %HoA)
    {
        print "$k\t@{$HoA{$k}}\n";
   }

OUTPUT

    ________DATA_________
        path1   gene1 gene2
        path2   gene3 gene4 gene5 gene6
        path3   gene7 gene8 gene9
        path4   gene1
 ___________RESULT_________
        gene1    path1 path4
        gene2    path1
        gene3    path2
        gene4    path2
        gene5    path2
        gene6    path2
        gene7    path3
        gene8    path3
        gene9    path3

score 0 · Answer 5 · 2014-04-03

Another option: the R/Bioconductor package KEGGREST. See the documentation for some examples.

It looks like this one allows multiple gene queries:

library("KEGGREST")
keggLink("pathway", c("eco:b0002", "eco:b0003"))

#        eco:b0002       eco:b0002       eco:b0002       eco:b0002       eco:b0002 
# "path:eco00260" "path:eco00270" "path:eco00300" "path:eco01100" "path:eco01110" 
#       eco:b0002       eco:b0002       eco:b0003       eco:b0003       eco:b0003 
# "path:eco01120" "path:eco01230" "path:eco00260" "path:eco01100" "path:eco01120" 
#       eco:b0003 
# "path:eco01230"