It's like the title says, I have some gene names, I want to know the pathways that they are involved in. Which method may work?thanks
It's like the title says, I have some gene names, I want to know the pathways that they are involved in. Which method may work?thanks
The first thing that comes to my mind is that you could search WikiPathways. Just copy your gene or protein in the search box and here you go. If you have to search for many genes you could use the WikiPathways webservices.
Since WikiPathways now contains Reactome you will find the Reactome pathways in this way as well, but of course you could search Reactome separately. You could use PathVisio, which is not only the editor applet of WikiPathyways but also a standalone pathway tool, to also search the converted KEGG pathways which you can download from the PathVisio site. But that will not give better results than what Michael suggested for searching KEGG directly.
Also check Pathwaycommons. They cover a lot of pathway resources and have a nice search feature. Their content includes a.o.: BioGRID, HumanCyc, MetaCyc, MINT, IntAct, the NCI/Nature pathway interaction database and Reactome.
Finally you might want to search your gene in GO. Many gene classes in GO actually are pathways or at least the genes in that class are clearly related to a specific biological pathway. So in that way you might find a few pathways where your gene does belong to, or is related to, while it is not really covered in the pathway itself yet.
There also are a number of species specific pathway resources. How useful these are of course depends on what species your genes are from.
Update: WikiPathways content is now also available as downloadable RDF and can be accessed through a SPARQL endpoint. Examples of useful SPARQL queries can be found here. These include queries to find all pathways containing a specific gene.
David and GSEA are my preferred online. But I had to do the same job inside a C/C++ program. I downloaded signature files from Broad institute: http://www.broadinstitute.org/gsea/downloads.jsp
Then I parsed and analyzed with the following code:
// input:
// geneIds: set of genes to look for
// filename:gsea filename
// cutoff: min. number of genes to match
// pLimit: min. desired significance
// output:
// genesetResult
// genesetP
static inline int overlapGeneSet(const set<string> &geneIds, const string &filename, int cutoff, double pLimit, vector< vector<string> > &genesetResult, vector<double> &genesetP){
const int BufSize(100000); // oversized input row buffer
char *buffer = (char *)malloc( BufSize );
int result(0);
ifstream gsea(filename.c_str());
string strVal;
char delimiter = '\t';
int gseaSize; // signature size
string gseaName; // signature name
string gseaSource; // signature desc.
vector<string> gseaCommonGenes; // number of matching genes
// foreach row/geneset
while(!gsea.eof()){
istringstream strstream;
gsea.getline(buffer, BufSize);
strstream.str(buffer);
gseaCommonGenes.clear();
gseaSize = -2;
gseaName = "";
gseaSource = "";
// foreach field in geneset
while(!strstream.eof()){
strVal = "";
getline(strstream, strVal, delimiter);
if(gseaSize==-2)
gseaName = strVal;
if(gseaSize==-1)
gseaSource = strVal;
if(gseaSize >= 0){
if(geneIds.find(strVal) != geneIds.end()){
gseaCommonGenes.push_back(strVal);
}
}
gseaSize++;
}
if(gseaCommonGenes.size() >= cutoff){
result++;
double P = <your enrichment test here>;
// e.g. hypg(NGenes, gseaSize, geneIds.size(), seaCommonGenes.size());
if(P < pLimit){
gseaCommonGenes.push_back(gseaName);
genesetResult.push_back(gseaCommonGenes);
genesetP.push_back(P);
}
}
}
free(buffer);
return result;
}
If you call a routine for setting enrichment-scoreat the line where I assign a value to P. Look also at:
I need to do this from the command line rather often, so I wrote a script to query KEGG called kg.
$ echo "Gna14" | kg -m 0 -q -s rno -d --noheader -
Gna14 04020 Calcium signaling pathway
Gna14 05142 Chagas disease (American trypanosomiasis)
Gna14 05146 Amoebiasis
You can also go the other way and get genes from pathway ids.
To explain the command line:
-m 0 # join on column 0 (there is only one gene name, hence only one column to join on.)
-s rno # the gene identifiers are for rattus norvegicus (use hsa for human and mmu for mouse)
-d # add definitions (the human readable part in the third column)
-q # quiet, do not show progress info on stderr
One advantage of using kg is that kg stores the data locally so subsequent queries are instantaneous.
Install with
pip install kg
See https://github.com/endrebak/kg for more.
The command line interface:
kg
Get KEGG data from the command line.
(Visit github.com/endrebak/kg for examples and help.)
Usage:
kg --help
kg --mergecol=COL --species=SPEC [--genes] [--definitions] [--noheader] [--quiet] FILE
kg --species=SPEC [--definitions] [--quiet]
kg --removecache
Arguments:
FILE infile to add KEGG data to (read STDIN with -)
-s SPEC --species=SPEC name of species (examples: hsa, mmu, rno...)
-m COL --mergecol=COL column (0-indexed int or name) containing gene names
Options:
-h --help show this message
-q --quiet do not show progress messages on stderr
-n --noheader the input data does not contain a header
-d --definitions add KEGG pathway definitions to the output
-g --genes get the genes related to KEGG pathways
(when used, mergecol COL should contain KEGG pathway
ids)
--removecache removes the local cache so that the KEGG REST DB is
accessed anew
Examples:
Write all KEGG info to STDOUT for "Rattus Norvegicus":
kg --species rno
Get all human pathways associated with the genes in column called "Gene" in
test.txt, merge them to the file, add pathway definitions and write to STDOUT
kg -s hsa -m Gene -d test.txt
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
It is a good idea to check the latest update date of DAVID before you use that. Currently that is from January 2010.