I'm trying out a few methods for enrichment analysis for GWAS data. For this kind of analysis I need a list of pathways (or GO categories, gene families, etc), where I have a gene grouping followed by all the genes in that grouping, for example:
path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
I downloaded the entire hsa-pathway folder from the KEGG FTP site before it closed down, and I thought I had all this information in a file called hsa_gene_map.tab that looks like this:
2793 04062 04724 04725
2796 04912
2797 04912
2798 04080 04912
2799 00531 01100 04142
But it turns out this is the opposite of what I need - the first number is the Entrez gene ID, and the following fields on each line are KEGG pathway IDs, i.e.
gene1 path1 path2
gene2 path3 path4 path5 path6
gene3 path7 path8 path9
For KEGG, how can I get a list of pathways and all the genes in each of those pathways? I don't have a subscription to KEGG here so I can't use the FTP site any more. A list of files in a simple format as in the first example will be the most useful, since I have to read each of these lines into a list in R to do the analysis.
Does it have to be KEGG? I think you should be able to get a list of genes by GO term from EnsEMBL! That is actually something I need to do in future analyses :-)
I would also like to do this for other gene groupings - e.g. gene ontology, reactome pathways, etc. The procedures described in the first link above use GO by default, but I'd like to start with KEGG first if possible.