Question

Protein Cluster similarity score as a correlation matrix

0

Entering edit mode

7.7 years ago

adi0957 ▴ 10

I wish to compare the proteins in each cluster and assign a similarity score based on how each cluster compares to each other. So, for example Cluster 1 to Cluster 1 would have 1 similarity, Cluster 1 to Cluster 2 0.7 similarity and so on and so forth. The number of proteins in each cluster is different, and so the score should be based on each individual clusters total number of proteins. Output should preferably be something like a similarity matrix, so it would look something like this:

Input:

Cluster 1   CSF2,NRAS,GSK3A,GSK3B 
Cluster 2   MAP3K7,HLA-DRA,NFKBIA,ZAP70 
Cluster 3   CSF2,NRAS,GRIN1,CDKN1A 
Cluster 4   GSK3A,GSK3B,NRAS,CSF2

Output:

           Cluster 1      Cluster 2      Cluster 3       Cluster 4
Cluster 1      1             0             0.33             1
Cluster 2      0             1             0                0
Cluster 3     0.33           0             1                0.33
Cluster 4      1             0             0.33             1

Any help or advice would be greatly appreciated, thank you.

alignment gene • 2.3k views

ADD COMMENT • link updated 7.7 years ago by Andrzej Zielezinski 11k • written 7.7 years ago by adi0957 ▴ 10

0

Entering edit mode

What bioinformatics level are you? Do you know how to use a double for loop in R, for instance? Do you know how to use %in% in R? Or are you novice?

It sounds a bit like a function I made for R package gogadget 2.0 gogadget: an R package for go analysis visualization and interpretation, the function gogadget.overlap. In this function I count the number of genes that overlap between GO terms, then I calculate the overlap index from that and visualize it in a heatmap.

Take a look at the R package https://sourceforge.net/projects/gogadget/ if you have some bioinformatics skills, if you are novice I advice you to try to get help from a bioinformatician in your neighborhood...

ADD REPLY • link 7.7 years ago by Benn 8.3k

0

Entering edit mode

I am a student, and only recently started delving into Bioinformatics so I am still a novice.

ADD REPLY • link 7.7 years ago by adi0957 ▴ 10

0

Entering edit mode

Okay, in that case I would suggest you learn some coding skills first. I don't think it would be helpful to write the code for you (you'll learn nothing from that). Good luck.

ADD REPLY • link 7.7 years ago by Benn 8.3k

0

Entering edit mode

Was this not answered in a recent previous thread: Protein name alignment for comparison and similarity score

ADD REPLY • link 7.7 years ago by GenoMax 147k

0

Entering edit mode

Hi, yes it was, but the output was a bit different. The person who provided the original answer suggested I open a new question where he could provide a solution. Thanks.

ADD REPLY • link 7.7 years ago by adi0957 ▴ 10

score 6 · Accepted Answer · 2017-03-31

In Python:

Script (script.py):

from itertools import combinations_with_replacement
import sys

with open(sys.argv[1]) as fh:
    # Reading data
    lst = []
    for line in fh:
        sl = line.strip().split('\t')
        cname = sl[0]
        cset = set(sl[1].split(','))
        lst.append((cname, cset))

    # Calculation
    n = len(lst)
    data = [[0 for i in range(n)] for i in range(n)]
    for i, j in combinations_with_replacement(range(n), 2):
        s1 = lst[i][1]
        s2 = lst[j][1]
        score = len(s1.intersection(s2))/float(len(s1.union(s2)))
        data[i][j] = score
        data[j][i] = score

    #Printing data
    header = "\t" + "\t".join([cname for cname, _ in lst])
    print(header)
    for i, row in enumerate(data):
        rowstr = lst[i][0]+"\t"
        rowstr += "\t".join(["{:.2f}".format(val) for val in row])
        print(rowstr)

Run:

python script.py clusters.txt

Output:

    Cluster 1   Cluster 2   Cluster 3   Cluster 4
Cluster 1   1.00    0.00    0.33    1.00
Cluster 2   0.00    1.00    0.00    0.00
Cluster 3   0.33    0.00    1.00    0.33
Cluster 4   1.00    0.00    0.33    1.00