Ok so I have been lurking for days to find a way to analyse my data, often ending up here... but I haven't found a satisfying answer yet. Lots of answers poping-up are 5+ years old. I am still part of the newbie gang. So here I come with a bunch of questions.
I have a few sets of ±100 proteins identified by MS as up/down-regulated under a few conditions. So I am looking for an easy workflow to obtain a scientifically accurate network, stringent enough to stay away from false positives although not TOO stringent so I can actually get some informations from my diverse network. As well as a pretty figure at the end to show off with :) I have Uniprot IDs, a fold change and a p-value. Rather looking for free apps (OS X) or web-based tools.
My idea is to end up on cytoscape to finalise some neat protein networks.
Based on direct physical interactions > common pathway/function > genetic interactions (it seems to dilute the essential information?). Rather experimental than predicted.
With a color code for up/down-regulated, scaled according to my fold change.
Maybe a shape code for molecular function/protein class.
A node size scaled to the number of edges (already mastering that one)
With also a clear identification of functional groups (biological process/pathway).
And several type of connections according to the source.
To give you an example of what I have in mind: http://www.nature.com/nbt/journal/v30/n10/full/nbt.2356.html (They use MetaCore which I don't have access to...)
First of all, Uniprot IDs... Somehow there's always some proteins that are left behind as the IDs are not being accurately recognised. I have checked and corrected them if necessary, my (human) IDs are up to date on Uniprot.org. How come? Do I have to just live with that?
Ingenuity, iPathway, Reactome don't seem to answer my needs.
Best tool to build my network:
STRING, I was quite happy with it at first but it smells like a trap. How to avoid false positives? Sources? Minimum score (0,7?), Max number of interactors to show (1st, 2d shell?)
GeneMANIA is quite seducing, all pretty and simple. I can actually select the kind of sources I am looking for... but way more hits than STRING. Lacking a score threshold? No control over it..
ConsensusPathDB. For what I've read, it seems to be THE ONE but it crashes with over 9000 listed interactions on my computer and for what I've seen it's pretty basic?
Cytoscape in-Apps: String, Genemania, Reactome... none seems to work as good as the web-based tools.
Cytoscape:
How do I impute any external data (Fold change, GO stuff, ...) to my network coming from STRING or GeneMANIA let say? I have a file for the network and a file with FC values etc...
I've read about GOlorize, BiNGO, iRegulon, ClusterViz, EnrichmentMap... where to go?
I don't feel like I am asking the moon. But we're in 2017 and I can't be the first one with that kind of request. How come it is so difficult? diluted within so many options? and there's no easy way to go? I don't mind "getting my hands dirty" but it seems endless here...
So, any solid workflow to follow?
First, regarding the ID problem, you should be aware that not all resources are synchronized or even use the same reference genome annotations, different resources annotate genome differently and annotations do change over time. So the first thing to do is decide which reference genome you're going to use and stick with it. Any gene/protein not in that reference doesn't exist for your purpose, i.e. if an ID doesn't map to something in the reference you can ignore it. The second step is to understand the data in the different databases and how it has been derived. Then you need to figure out what it is you want to do/show in relation to the biology you're interested in. Should the edges of the graph represent documented physical interactions or more generically functional relationships ? Finally if you want to use Cytoscape, start with reading the documentation. It will give you ideas on how to import data.
Thank you for your answer!
First. That's what I was afraid of. I have much more "miss" with GeneMANIA than with STRING. So one point for STRING.
Second. Your comment is quite pertinent as the sources of STRING are not as clear to me as GeneMANIA ones. One point for GeneMANIA I would like 2 types of edge. Bold and shiny physical interactions on one side. Thin and fade functional relationships.
Finally, I'm reading and learning... there's such a huge amount of documentation... but got it.
Regarding STRING, I would pick the following parameters Sources: text-mining, experiments, databases, co-expression pValue: 0,7 1st shell: query proteins only 2d shell: 5-10 proteins... and kick out some if similar protein in a same close network (ie: polymerase subunits)
Any feedback? What do you think about the clustering fuctions (Kmeans, MCL?)
Please do not create answers to reply to comment, use the 'add reply' button for this. This keeps the discussion organized.
Depending on what you want the edges of the graph to mean and on what you want to do with the graph, you should select different things. Also the functionally relevant information is concentrated in different data types for different organisms. For example, for human, most of the information on biological processes is in the physical protein interaction graph whereas for some model organisms it is in the genetic interactions. In my experience, co-expression data is useless for inferring gene function. The clustering algorithm to choose depends also on your data representation e.g. standard k-means operates on vectors, not on graphs. Given some query proteins, extracting a relevant subgraph from the whole graph is not an easy problem but I have found that the limited k-walk approach works well in many cases (I have made it available in my Graph.pm perl module)
If you decide to use STRING for this purpose, which should depends on what kind of network you are interested in, take a look at the new STRING app for Cytoscape. It makes importing a STRING network for a proteomics dataset into Cytoscape much less painful.
I have tried the app but it doesn't work as easily as the web app for me. Also, having a file before I import the network in Cytoscape allows me to add some parameters!
There is always a tradeoff between "easy" and "features". Sure, the web interface of STRING is the easiest way to access STRING. However, part of making it so easy is to leave out features. So if you need more features, e.g. the ability to map your own data onto a network, you have to use an interface that is not quite as easy. I'm not sure what you mean by "add some parameters" - if it is external parameters about the proteins, the normal workflow would be to use the "import table" functionality of Cytoscape.
I don't know if someone already recommended this, but I'm in the same problem. I can't even properly install the programs I want... but theres a pretty awesome and easy way to make a SSN:
The "EFI - Enzyme Similarity Tool" http://efi.igb.illinois.edu/efi-est/
This is a web-tool that allows you to either upload your own fasta sequences, or retrieve them by UniProt IDs or NCBI IDs. and offers a bunch of other options for your edge and node attributes. You can even make a Genome Neighbourhood Network, for enzyme pathways or whatever. You can then modify your edge thresholds to your liking, and some other options for your network to look good. Hope this helps someone!