String - Number Of Intermediate Interactors Between Proteins
4
2
Entering edit mode
10.8 years ago
pfbusson ▴ 20

Hello,

Using STRING-DB, I need to build a data frame containing for each pair of proteins the number of interactors between them...

For example, if protein A and protein B have a common line in the node_node_links table, they are in direct interaction, and the number of intermediates will be 0. If protein A and protein D are not in direct interaction, but protein A interacts with protein C which in turn interacts with protein D, the number of intermediates between A and D will be 1. And so on, for the 20770 proteins...

A B 0
A D 1
...

I wonder if any of you ever faced this problem and would be able to give me a time and memory efficient solution.

Thanks !

interaction protein network • 5.7k views
ADD COMMENT
1
Entering edit mode

"data frame" suggests you want a solution in R?

ADD REPLY
0
Entering edit mode

yes Neil, I know him: that's what pfbusson wants.

ADD REPLY
1
Entering edit mode
10.8 years ago
B. Arman Aksoy ★ 1.2k

If I were to do it now, I would pick either of the following two methods:

1) Calculate the distance for each node pair using a graph library

This approach requires you to convert the data into a graph model for your library of choice. I previously did a similar thing with Jung, which is Java-based, and was really happy with it. It is hard to estimate the required memory, but I was able to calculate the length of shortest path for all possible node pairs on a PPI network I obtained from GeneMania (human) in a few hours on a standard laptop -- so it was not that bad.

If you want to see an example, you can check out this repository for an old project of mine and you will find my Java implementation that utilizes Jung (the documentation is a little bit scarce, but the code is simple):

https://bitbucket.org/armish/pathwayguy

and here is the code that calculates the pairwise shortest-path distances:

AbstractPairwiseDistanceMethod.java

I am guessing you can also go with the R-library igraph to do this calculation; I never tried it, but from the things I heard from my colleagues, I can tell it is a good alternative to Jung (especially if you are an R-person).

2) Work with adjancecy matrix

This requires you to convert the interaction file into an adjacency matrix (all nodes vs all nodes) and then multiplying it by itself, i.e. taking powers. I believe this is also not that memory intensive, especially if you go with a matrix multiplication method that is optimized for sparse matrices (R and Matlab both have many of those). For this alternative, the idea goes like this:

  • You create the adjancency matrix and every non-zero value in a cell indicates a distance of 0 for the corresponding node pair (A^1)
  • You multiply the matrix by itself and now every non-zero value in a cell indicates a distance of 1 for the corresponding node pair (A^2)
  • You multiply the matrix by itself and now every non-zero value in a cell indicates a distance of 2 for the corresponding node pair (A^3)
  • You multiply the matrix by itself and now every non-zero value in a cell indicates a distance of 3 for the corresponding node pair (A^4) ...

It is not a great approach, but surprisingly people who are familiar with matrix-based operations in either R or Matlab find this approach relatively easier to implement.

ADD COMMENT
0
Entering edit mode

+1 for suggesting igraph

ADD REPLY
1
Entering edit mode
10.8 years ago
Neilfws 49k

I don't have the node_node_links table, but let's assume that you can obtain the partners as a tab-delimited text file like this:

A    B
A    C
B    C
C    D

Then a graph can be created using R igraph:

library(igraph)
d <- read.table("myfile.txt", sep = "\t", header = F, stringsAsFactors = T)
g <- graph.data.frame(d)

A list of all igraph methods is here. Something like what you want can be achieved using shortest.paths():

shortest.paths(g)
  A B C D
A 0 1 1 2
B 1 0 1 2
C 1 1 0 1
D 2 2 1 0

I don't know how long this would take using the STRING database; it could potentially be a very long time.

ADD COMMENT
1
Entering edit mode
ADD REPLY
2
Entering edit mode

OK; just ran the above code using that data on a work server (24x X5680 Xeon CPU, 100 GB RAM); took about 15-20 minutes.

ADD REPLY
0
Entering edit mode

This is exactly what I was looking for. Thanks a lot !

ADD REPLY
0
Entering edit mode
10.8 years ago

prior warning: pfbusson works in my lab.

I've implemented a java idea: https://github.com/lindenb/jvarkit/wiki/Biostar92368

Put all the interactions in a berkeleyDB with key=protein, value=set(interactors)

(...)
        while(r.hasNext())
            {
            String line=r.next();
            String tokens[]=tab.split(line);
            for(int i=0;i< 2;++i)
                {
                String a=tokens[i==0?0:1];
                String b=tokens[i==0?1:0];
                StringBinding.stringToEntry(a, key);
                OperationStatus status=c.getSearchKey(key, data, LockMode.DEFAULT);
                Set<String> partners; 
                if(status==OperationStatus.SUCCESS)
                    {
                    partners=bindingSet.entryToObject(data);

                    }
                else
                    {
                    partners=new HashSet<String>(1);
                    }
                partners.add(b);
                bindingSet.objectToEntry(partners,data);
                if(status==OperationStatus.SUCCESS)
                    {
                    status=c.putCurrent(data);
                    }
                else
                    {
                    status=c.put(key, data);
                    }
                }
            }
(...)

and then recursively scan all the paths to go from protein1 to protein2:

private int recursive(
        Transaction txn,
        final String endProt,
        Stack<String> path,
        int bestDepth,
        final int maxDepth
        )
    {
    DatabaseEntry key=new DatabaseEntry();
    DatabaseEntry data=new DatabaseEntry();
    StringBinding.stringToEntry(path.lastElement(), key);
    if(this.database.get(txn,key,data,null)==OperationStatus.SUCCESS)
        {
        Set<String> partners=this.bindingSet.entryToObject(data);
        for(String prot2:partners)
            {
            int depth=-1;
            if(prot2.equals(endProt))
                {
                depth=path.size()-1;
                }
            else if(path.size()<maxDepth && !path.contains(prot2))
                {
                path.push(prot2);
                depth=recursive(txn,endProt,path,bestDepth,maxDepth);
                path.pop();
                }

            if(depth!=-1 && (bestDepth==-1 || bestDepth>depth))
                {
                bestDepth=depth;
                }
            }
        }
    return bestDepth;
    }

Example:

$ cat input.txt
A1  A2
A2  A3
A3  A4
A4  A5
A1  A6

$ mkdir -p tmp
$  java -jar dist/biostar92368.jar -D tmp input.txt

A1  A2  0
A1  A3  1
A1  A4  2
A1  A6  0
A2  A3  0
A2  A4  1
A2  A5  2
A2  A6  1
A3  A4  0
A3  A5  1
A3  A6  2
A4  A5  0
ADD COMMENT
0
Entering edit mode
10.8 years ago
Arnaud Ceol ▴ 860

The networkx library for python has a method for it: networkx.shortest_path_length(self.network) http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html

ADD COMMENT

Login before adding your answer.

Traffic: 1857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6