I am attempting to create a script/pipeline which will conduct a form of pathway analysis. The overall idea involves automating a way of extracting all of the genes from a given biological pathway from any one of a few big online resources (KEGG, Reactome and Wikipathways), this is fairly challenging on its own as each resource is curated differently so the genes in one resource will differ slightly from in the others.
The real challenge, however, is the next stage of the pipeline which would involve defining a specific "end point" of the pathway. While obviously this isn't exactly the way things work, as I don't think a great many pathways will have a specific point where they just stop, the idea is still that there will be some final gene/protein (or even metabolite) which is the "end" of that pathway.
Doing this manually is easy enough of course, just looking at the online maps to see which genes are at the ends of the pathways, however automating it currently seems impossible. Specifically with relation to the resources I mentioned, when you extract genes from a pathway you lose all sense or order and hierarchy of them.
Ideally, in my mind, I'd see it being an ordered list of genes/transcripts with genes further down the pathway lower on the list.
Does anyone know of any method of doing this? Any online resource or package (either for R or maybe Bash/Python) which might allow the ordering of genes from a pathway?
Sorry if it wasn't clear, I'll try to clarify.
So, extracting the genes for any given pathway isn't difficult, there are even several R packages which allow one to extract a list of genes from a pathway ID.
What I need specifically is a way of defining the "end point" of the pathway, so you've rather hit the nail on the head with your own question. Using the example of MAPK signalling from KEGG you have a series of reactions happening with genes/proteins interacting with other genes/proteins downstream (so TNF interacts with TNFR interacts with TRADD etc etc down to p53 which seems to be a defined "end point"). What I want to be able to do is to list the genes in the order of the interactions (so TNF would be high on the list, while p53 would be low).
Naturally there are a number of issues I can see with this already such as concurrent or parallel interactions, but there would still be some higher in the list and some lower. At least this is how I imagine it. Does that clarify what I meant?
In your example there are MANY endpoints to that pathway not just p53. As Kevin pointed out there is a relation tag you can use. I want to add if you consider end points as also being steps into different pathways (eg: apoptosis) then the xml doc contains
type="roundrectangle"
which is a cheeky way of detecting these "end points". It's not comprehensive though.Maybe if you flesh out your idea we could help further. Why do you care about these end points in pathways?
So the issue of multiple "end-points" is definitely one of the things I am trying to overcome. I simply mean it in the sense of trying to define where one pathway ends, so even if this has multiple separate end points then trying to define any of these in a methodical way. So your example of
type="roundrectangle"
might be exactly the sort of thing I'm looking for.Very loosely I am trying to see how well expression from transcriptomics in separate pathways predict the expression of "end points" in the same data. And I want to try and apply this to as many pathways as possible (even all of them available). So I have the data available I just need to somehow computationally select an "end point". If that makes sense?
Hey, yes, I now see what you mean. I'm not sure that that information is recorded anywhere, though. In some pathways, there would obviously be some looping back and forth based on checkpoint signals and buffer controls, and some proteins / compounds would be important through the entire pathway.
In the XML file available through your link, there is a
<relation
tag that may contain useful information - I'm not sure. Other databases, such as STRINGdb, may contain more useful information on 'flow' through these pathways.Actually a couple of hours after asking this question I did finally stumble across the XML file see the
relation
tag you mention. And this did seem like it could be a potential solution in that I could find the "end points" based on which relation tag had the highest numbers compared to everything else. It might take some parsing, but at the very least seemed to be at least one computational method of doing this.StringDB seems like it might potentially be useful in a similar way, if I could somehow potentially select only genes from a certain pathway from KEGG/Reactome from the StringDB set. Thanks for the suggestion!