Mapping Pathways To Genes
5
1
Entering edit mode
10.7 years ago
juncheng ▴ 220

Hi!

I want to assign pathways to a gene (protein) list like this:

gene1   path1 path2
gene2   path3 path4 path5 path6
gene3   path7 path8 path9

But what I got from kegg mapper is like this:

enterpath1   gene1 gene2
path2   gene3 gene4 gene5 gene6
path3   gene7 gene8 gene9 code here

I cannot find a way to do this from kegg mapper, I think I need a script to do this. From the KEGGREST package of Bioconductor it seams possible but low efficient.

kegg pathway • 3.8k views
ADD COMMENT
1
Entering edit mode

What IDs do you have for gene1, gene2 etc. ? Are they KEGG gene IDs, such as eco:b0002?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

yes, I read that before, but what I want is exactly he does not want, I want list of gene and which pathway it envoles.

ADD REPLY
0
Entering edit mode

Using bash, make a for loop of unique gene names, and grep rows where gene name matches, get first column...

ADD REPLY
4
Entering edit mode
10.7 years ago
Neilfws 49k

You don't say what kind of gene IDs you have. If you have, or can convert to KEGG gene IDs (such as eco:b0002), genes -> pathways is easy using the TogoWS REST service.

curl http://togows.dbcls.jp/entry/genes/eco:b0002/pathways.json
[
  {
    "eco00260": "Glycine, serine and threonine metabolism",
    "eco00270": "Cysteine and methionine metabolism",
    "eco00300": "Lysine biosynthesis",
    "eco01100": "Metabolic pathways",
    "eco01110": "Biosynthesis of secondary metabolites",
    "eco01120": "Microbial metabolism in diverse environments",
    "eco01230": "Biosynthesis of amino acids"
  }
]
ADD COMMENT
0
Entering edit mode

cool, I'm trying to use it.

ADD REPLY
0
Entering edit mode

I tried this, and it seams super for a short list of genes, isn't it? But I have a list of near 1000 genes, does there anyway to do this in a "large scale"? Thanks again.

ADD REPLY
0
Entering edit mode

Not sure about batch retrieval - probably not possible. You could always write a loop I suppose...but there's surely a quicker way.

ADD REPLY
2
Entering edit mode
10.7 years ago
paulr ▴ 80

Here's a cleaner version that assumes you've cleaned up the header and footer from the REST output.

Also, hopefully, this example is expressive enough so you can tweak it for future use.

import sys
#
# Handle i/o filenames
#
input_fname = sys.argv[1]
output_fname = sys.argv[2]
#
# Process data
#
data_dict = {}
for line in file(input_fname):
    line = line.strip()
    spl = map(lambda x: x.strip(), line.split() ) # Default split by space, clean extra whitespace(s)
    pathway = spl[0] # Pathway name
    genes = spl[1:] # List of genes
    for gene in genes:
        current_pathways = data_dict.get(gene, []) # Obtain existing list of pathways associated with gene
        current_pathways.append(pathway) # Append to existing list of pathways, if the list exists
        current_pathways = list(set(current_pathways)) # Make pathway list unique
        data_dict[gene] = current_pathways # Let's update the dictionary
#
# Flush output to file
#
ofile = file(output_fname, "w")
for gene, pathway_list in data_dict.iteritems():
    pathways = " ".join(pathway_list) # Space separated list of pathways, or your choice of delimiter
    ofile.write("%(gene)s\t%(pathways)s\n" % dict(pathways=pathways, gene=gene)) 
ofile.close()
ADD COMMENT
0
Entering edit mode

Thank, couldn't be better.

ADD REPLY
1
Entering edit mode
10.7 years ago
Xingyu Yang ▴ 280
#!/use/bin/env python  
import sys  
paths={}
for l in sys.stdin:
    p=l.rstrip().split()
    for i in range(1,len(p)):
        if p[i] not in paths:paths[p[i]]=[]
        paths[p[i]].append(p[0])
for k in paths:
    print(k,'\t'.join(paths[k]),sep='\t')

I didn't bebug the above script, but I think it would work. I know there might be some indent issues if you just copy the python script from Biostars.

usage:

cat input | python thisscript.py
ADD COMMENT
0
Entering edit mode

Thanks, see whether it works.

ADD REPLY
1
Entering edit mode
10.7 years ago
Prakki Rama ★ 2.7k

I tried it using Perl Hashes of arrays.

    open FH,"input.txt";
    %HoA=();
    print "___________DATA_________\n";
    while(<FH>)
    {
    print "$_";
        @array=split;
        for($i=1;$i<=$#array;$i++)
        {
        push @{ $HoA{$array[$i]} }, "$array[0]";
        }
    }
    print "___________RESULT_________\n";
    foreach $k (sort keys %HoA)
    {
        print "$k\t@{$HoA{$k}}\n";
   }

OUTPUT

    ________DATA_________
        path1   gene1 gene2
        path2   gene3 gene4 gene5 gene6
        path3   gene7 gene8 gene9
        path4   gene1
 ___________RESULT_________
        gene1    path1 path4
        gene2    path1
        gene3    path2
        gene4    path2
        gene5    path2
        gene6    path2
        gene7    path3
        gene8    path3
        gene9    path3
ADD COMMENT
0
Entering edit mode

Thanks, I thinks this helps a lot for people who also have this problem. Unfortunately I never used Perl.

ADD REPLY
0
Entering edit mode
10.7 years ago
Neilfws 49k

Another option: the R/Bioconductor package KEGGREST. See the documentation for some examples.

It looks like this one allows multiple gene queries:

library("KEGGREST")
keggLink("pathway", c("eco:b0002", "eco:b0003"))

#        eco:b0002       eco:b0002       eco:b0002       eco:b0002       eco:b0002 
# "path:eco00260" "path:eco00270" "path:eco00300" "path:eco01100" "path:eco01110" 
#       eco:b0002       eco:b0002       eco:b0003       eco:b0003       eco:b0003 
# "path:eco01120" "path:eco01230" "path:eco00260" "path:eco01100" "path:eco01120" 
#       eco:b0003 
# "path:eco01230"
ADD COMMENT

Login before adding your answer.

Traffic: 2303 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6