STRING is a database of predicted protein-protein interactions at EMBL. It cluster the results from many sources of protein-protein interactions databases, like Mint, etc.., and it also use the informations from KEGG-pathways and reactome, to provide the best annotations for the interactions of a protein.
I am a bit confused from the results that I see there, because when I look at the genes in the pathway I am studying, I see many errors and annotations that I don't understand.
What is your experience with STRING? If you want to do me a favor, go there and try to see the interactions annotated for a gene that you know already. Do you see anything weird?
I have used STRING in three projects and I am still using it for large scale protein-protein interaction data analysis. I have downloaded the data and worked on PPI data of 5 eukaryotic model organisms. I strongly recommend STRING if you are looking for prokaryotic PPI data or if you working on a global scale of PPI network analysis in any given organism. An exceptional advantage about STRING is that they derive the PPI information from multiple approaches, still every single single interaction is scored using a scoring scheme. This gives a higher advantage to filter specific interactions that you are interested in (for example you can get PPI from human that have a score >0.7 from experimental approach) and thus you can reduce the false positive rate. Another interesting aspect of STRING is the predicted interactions that are not reported in DIP or HPRD (If you are looking for literature curated, experimental annotations I strongly recommend HPRD ), this is something really exciting. You may get an interesting connections (not yet proven, though) that can lead you to new biological insights. The STRING team also maintain an interesting blog, with the new releases, code-snippets, API detailes etc.
Have you looked at their web site or downloadable files ? AFAIK, STRING basically use Ensembl IDs in their PPI files but provide another mapping file to map from other identitfiers. The problem of mapping a gene to a pathway is always not a direct approach, think of this scenario : 1 gene, n transcript and one of them could go in to pathway. 'n' transcripts code for n splice variants of same protein, so it is not wrong in merging the IDs of transcripts to one gene ID.
I looked at the genes in the pathway that I studied and I have found a lot of errors, including genes with similar names being merged as one, and many false positives due to genes being in the same pathway in some database. And my pathway is not exactly badly annotated, it was already described in the '80s...
I've been using STRING extensively, but not for protein-protein interactions work. STRING, as you note, is a bit of a mutt in terms of the different data sources it mines. Some that you're missing include a broad literature-based search, as well as gene expression data sets. So if you're interested primarily in physical interactions or any other single type of data source, STRING is a poor choice for your work. On the other hand, STRING does provide confidence scores for each association, as well as annotation for their data source types (with the license). So you can use those to filter out the interactions derived from data types you don't want to see.
I have not used STRING in particular but I have worked with protein interactions before (DIP dataset). I recall that even experimentally produced protein-protein interactions may have very large false positive ratios (as for false negatives, who knows?) Some papers claim that up to 50% of the interactions were spurious; and repeated experiments showed very small overlaps. Predictions may be even less reliable.
At the same time the DIP dataset performed substantially better if we only considered the interactions for which there were multiple sources of evidence, so that may be a strategy to consider in your case as well.
Have you looked at their web site or downloadable files ? AFAIK, STRING basically use Ensembl IDs in their PPI files but provide another mapping file to map from other identitfiers. The problem of mapping a gene to a pathway is always not a direct approach, think of this scenario : 1 gene, n transcript and one of them could go in to pathway. 'n' transcripts code for n splice variants of same protein, so it is not wrong in merging the IDs of transcripts to one gene ID.
I looked at the genes in the pathway that I studied and I have found a lot of errors, including genes with similar names being merged as one, and many false positives due to genes being in the same pathway in some database. And my pathway is not exactly badly annotated, it was already described in the '80s...