I have absolutely no idea whether the metric you're asking for is really meaningful, but it's easy to get on a pathway module level.
Solution for R with the use of tidyverse and KEGGREST package.
kegg_id_names variable at the beginning of the code should be the vector of kegg id's for your genes.
kegg_positions_by_module is a tibble containing relative position of a gene in each module that the gene is listed in (in kegg). Such a position for a given pathway is calculated from the list of genes in module description (like here in reaction section: https://www.genome.jp/kegg-bin/show_module?M00001 ). Genes on the same "level" share the same score.
library(tidyverse)
library(KEGGREST)
kegg_id_names = c('R01786', 'R01063')
path_ids = map(kegg_id_names, ~keggLink('module', .)) %>%
unlist() %>%
unique()
path_data = keggGet(path_ids)
get_kegg_position = function(path_record) {
gene_names = names(path_record$REACTION)
if (is.null(gene_names)) {
res = tibble(gene = NA,
position = NA)
}else{
gene_posns = seq(1, 0, length.out = length(gene_names))
res = tibble(gene = gene_names,
position = gene_posns) %>%
separate_rows('gene', sep = ',')
}
res %>% mutate(path_id = path_record$ENTRY,
path_name = path_record$NAME)
}
kegg_positions_by_module = map_df(path_data,
get_kegg_position) %>%
filter(!is.na(position))
You can get position in whole pathway replacing kegglink part by "~keggLink('pathway', .)", but I'think it can be wrong since whole pathway is more "intertwined".
Thank you for your answer, I managed to adapt it to Pathways instead of Modules :
Here I obtain a value of only 0.433 for TGFB1 whereas I should get value closer to 1. For SRF gene, I obtain value of 0.481 which is higher than TGFB1 while SRF is upstream of only 1 gene and TGFB1 is upstream of many more genes. (See MAPK pathway diagram)
I think your approach is based on the hypothesis that KEGG ranks genes from more upstream to more downstream. While this might be true for reactions, I think it is not for Pathways. I think the order of genes in KEGG pathway is from Left to Right and Top to Bottom and in the KEGG diagrams.
I am sure this score would never be accurately quantitative but I would hope it gives an idea about how often a gene is upstream of pathways.
I think that this complexity on a pathway level may be managed using some sort of graphs, so perhaps KEGGgraph package may provide some solution. But I haven't work with it. If you'll find a solution, please share it.
Hello Alex,
I used KEGGgraph as suggested to get Directed Graphs from KEGG pathway database for each gene in my gene list. Here is a solution, the code is probably suboptimal but I hope it is understandable.
The first fonction GetKEGGigraph takes in a KEGG pathway id (e.g. "04010" for MAPK signalling pathway). It first download the pathway as Directed Graph using KEGGgraph, then transform it into an "igraph graph" object. Each gene is a node and each relation geneA -> geneB is a directed edge. We first remove isolated nodes & self loops from the graphs. Then, as some pathways are Cyclic, we need to make them acyclic using Minimum Spanning Tree algorithm. This is because we cannot make a proper gene hierarchy in cyclic graphs. The GetKEGGigraph function returns the Minimum Spanning Tree.
The second function upstream_score_KEGG takes as input a gene list as gene names (e.g. c("TGFB1","MAX")). It first retrieve for each gene all the pathways containing the gene. Then, for each gene in each pathway, it calls GetKEGGigraph to get the Minimum Spanning Tree. The tree is sorted using 'Topological Sorting' algorithm (From wikipedia :"In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering"). This way, most upstream genes will be placed before more downstream genes. The function upstream_score_KEGG outputs a data.frame with the upstream score (1 - relative position) of the gene in each of it's pathway, or NA if no pathway is found.
I have very little experience with graph theory so if I would gladly take advices on this part, particularly : it this the best way to get an 'order' of genes in a graph ? I think that with topological sorting, two 'unrelated nodes' will be placed in a kind of arbitrary way.
Here is the code:
So the results is, for MAPK signalling pathway, TGFB1 has score of 0.731 & MAX has score of 0.187. This gives you an idea of how upstream the gene is in a pathway. In the 'TGF-beta signaling pathway' for instance the score for TGFB1 is only 0.31 because it is downstream of 5 genes in this pathway (TGF-beta signaling pathway).
The trouble comes when you average over all the pathways, because as you said pathways are complex and the end of any pathways usually comes before and after any pathway.... So the same gene might be the most downstream in one pathway and the most upstream in another pathway...
I hope the approach is not too complex for nothing and I would gladly hear any comments on better way to achieve the same results.
Thanks for sharing the solution! Unfortunately, I can not provide you any further guidance(