How To Find The Pathways In Which A Given Gene Or Protein Is Involved?
6
6
Entering edit mode
13.6 years ago
K-Li ▴ 80

It's like the title says, I have some gene names, I want to know the pathways that they are involved in. Which method may work?thanks

gene pathway • 22k views
ADD COMMENT
18
Entering edit mode
13.6 years ago

The first thing that comes to my mind is that you could search WikiPathways. Just copy your gene or protein in the search box and here you go. If you have to search for many genes you could use the WikiPathways webservices.

Since WikiPathways now contains Reactome you will find the Reactome pathways in this way as well, but of course you could search Reactome separately. You could use PathVisio, which is not only the editor applet of WikiPathyways but also a standalone pathway tool, to also search the converted KEGG pathways which you can download from the PathVisio site. But that will not give better results than what Michael suggested for searching KEGG directly.

Also check Pathwaycommons. They cover a lot of pathway resources and have a nice search feature. Their content includes a.o.: BioGRID, HumanCyc, MetaCyc, MINT, IntAct, the NCI/Nature pathway interaction database and Reactome.

Finally you might want to search your gene in GO. Many gene classes in GO actually are pathways or at least the genes in that class are clearly related to a specific biological pathway. So in that way you might find a few pathways where your gene does belong to, or is related to, while it is not really covered in the pathway itself yet.

There also are a number of species specific pathway resources. How useful these are of course depends on what species your genes are from.

Update: WikiPathways content is now also available as downloadable RDF and can be accessed through a SPARQL endpoint. Examples of useful SPARQL queries can be found here. These include queries to find all pathways containing a specific gene.

ADD COMMENT
5
Entering edit mode
ADD COMMENT
4
Entering edit mode
13.6 years ago
Stew ★ 1.4k

I would recommend DAVID for an easy way to go from gene lists to functional information, such as pathways. It contains lots of the databases mentioned by other people here and is very well documented and highly cited.

ADD COMMENT
0
Entering edit mode

It is a good idea to check the latest update date of DAVID before you use that. Currently that is from January 2010.

ADD REPLY
1
Entering edit mode
13.3 years ago
ff.cc.cc ★ 1.3k

David and GSEA are my preferred online. But I had to do the same job inside a C/C++ program. I downloaded signature files from Broad institute: http://www.broadinstitute.org/gsea/downloads.jsp

Then I parsed and analyzed with the following code:

// input:
// geneIds: set of genes to look for
// filename:gsea filename
// cutoff:  min. number of genes to match
// pLimit:  min. desired significance
// output:
// genesetResult
// genesetP

    static inline int overlapGeneSet(const set<string> &geneIds, const string &filename, int cutoff, double pLimit, vector< vector<string> > &genesetResult, vector<double> &genesetP){
        const int BufSize(100000); // oversized input row buffer
        char *buffer = (char *)malloc( BufSize );
        int result(0);
        ifstream gsea(filename.c_str());
        string strVal;    
        char delimiter = '\t';

        int gseaSize;                   // signature size
        string gseaName;                // signature name
        string gseaSource;              // signature desc.
        vector<string> gseaCommonGenes; // number of matching genes

        // foreach row/geneset
        while(!gsea.eof()){
            istringstream strstream;
            gsea.getline(buffer, BufSize);
            strstream.str(buffer);
            gseaCommonGenes.clear();
            gseaSize = -2;
            gseaName = "";
            gseaSource = "";
            // foreach field in geneset
            while(!strstream.eof()){
                strVal = "";
                getline(strstream, strVal, delimiter);
                if(gseaSize==-2)
                    gseaName = strVal;
                if(gseaSize==-1)
                    gseaSource = strVal;
                if(gseaSize >= 0){
                    if(geneIds.find(strVal) != geneIds.end()){
                        gseaCommonGenes.push_back(strVal);
                    }
                }
                gseaSize++;
            }
            if(gseaCommonGenes.size() >= cutoff){
                result++;
                double P = <your enrichment test here>; 
                        // e.g. hypg(NGenes, gseaSize, geneIds.size(), seaCommonGenes.size());
                if(P < pLimit){
                    gseaCommonGenes.push_back(gseaName);
                    genesetResult.push_back(gseaCommonGenes);
                    genesetP.push_back(P);
                }
            }
        }
        free(buffer);
        return result;
    }

If you call a routine for setting enrichment-scoreat the line where I assign a value to P. Look also at:

ADD COMMENT
0
Entering edit mode

Hey ff.cc.cc What signature files are you talking about? I want to use your methodology of identification, could you tell me what archives did you use? thanks

ADD REPLY
0
Entering edit mode

This is question is 5 years old and ff.cc.cc hasn't been on biostars since 2 years and 3 months ago, so I wouldn't count on an answer here! ;)

ADD REPLY
0
Entering edit mode
13.4 years ago
Dataminer ★ 2.8k

GSEA from broad institute.

ADD COMMENT
0
Entering edit mode
9.3 years ago

I need to do this from the command line rather often, so I wrote a script to query KEGG called kg.

$ echo "Gna14" | kg -m 0 -q -s rno -d --noheader -
Gna14    04020    Calcium signaling pathway
Gna14    05142    Chagas disease (American trypanosomiasis)
Gna14    05146    Amoebiasis

You can also go the other way and get genes from pathway ids.

To explain the command line:

-m 0 # join on column 0 (there is only one gene name, hence only one column to join on.)
-s rno # the gene identifiers are for rattus norvegicus (use hsa for human and mmu for mouse)
-d # add definitions (the human readable part in the third column)
-q # quiet, do not show progress info on stderr

One advantage of using kg is that kg stores the data locally so subsequent queries are instantaneous.

Install with

pip install kg

See https://github.com/endrebak/kg for more.

The command line interface:

kg

Get KEGG data from the command line.
(Visit github.com/endrebak/kg for examples and help.)

Usage:
    kg --help
    kg --mergecol=COL --species=SPEC [--genes] [--definitions] [--noheader] [--quiet] FILE
    kg --species=SPEC [--definitions] [--quiet]
    kg --removecache

Arguments:
    FILE                    infile to add KEGG data to (read STDIN with -)
    -s SPEC --species=SPEC  name of species (examples: hsa, mmu, rno...)
    -m COL --mergecol=COL  column (0-indexed int or name) containing gene names

Options:
    -h --help               show this message
    -q --quiet              do not show progress messages on stderr
    -n --noheader           the input data does not contain a header
    -d --definitions        add KEGG pathway definitions to the output
    -g --genes              get the genes related to KEGG pathways
                            (when used, mergecol COL should contain KEGG pathway
                            ids)
    --removecache           removes the local cache so that the KEGG REST DB is
                            accessed anew

Examples:

    Write all KEGG info to STDOUT for "Rattus Norvegicus":

        kg --species rno

    Get all human pathways associated with the genes in column called "Gene" in
    test.txt, merge them to the file, add pathway definitions and write to STDOUT

        kg -s hsa -m Gene -d test.txt
ADD COMMENT

Login before adding your answer.

Traffic: 2144 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6