Protein Cluster similarity score as a correlation matrix
1
0
Entering edit mode
7.7 years ago
adi0957 ▴ 10

I wish to compare the proteins in each cluster and assign a similarity score based on how each cluster compares to each other. So, for example Cluster 1 to Cluster 1 would have 1 similarity, Cluster 1 to Cluster 2 0.7 similarity and so on and so forth. The number of proteins in each cluster is different, and so the score should be based on each individual clusters total number of proteins. Output should preferably be something like a similarity matrix, so it would look something like this:

Input:

Cluster 1   CSF2,NRAS,GSK3A,GSK3B 
Cluster 2   MAP3K7,HLA-DRA,NFKBIA,ZAP70 
Cluster 3   CSF2,NRAS,GRIN1,CDKN1A 
Cluster 4   GSK3A,GSK3B,NRAS,CSF2

Output:

           Cluster 1      Cluster 2      Cluster 3       Cluster 4
Cluster 1      1             0             0.33             1
Cluster 2      0             1             0                0
Cluster 3     0.33           0             1                0.33
Cluster 4      1             0             0.33             1

Any help or advice would be greatly appreciated, thank you.

alignment gene • 2.3k views
ADD COMMENT
0
Entering edit mode

What bioinformatics level are you? Do you know how to use a double for loop in R, for instance? Do you know how to use %in% in R? Or are you novice?

It sounds a bit like a function I made for R package gogadget 2.0 gogadget: an R package for go analysis visualization and interpretation, the function gogadget.overlap. In this function I count the number of genes that overlap between GO terms, then I calculate the overlap index from that and visualize it in a heatmap.

Take a look at the R package https://sourceforge.net/projects/gogadget/ if you have some bioinformatics skills, if you are novice I advice you to try to get help from a bioinformatician in your neighborhood...

ADD REPLY
0
Entering edit mode

I am a student, and only recently started delving into Bioinformatics so I am still a novice.

ADD REPLY
0
Entering edit mode

Okay, in that case I would suggest you learn some coding skills first. I don't think it would be helpful to write the code for you (you'll learn nothing from that). Good luck.

ADD REPLY
0
Entering edit mode

Was this not answered in a recent previous thread: Protein name alignment for comparison and similarity score

ADD REPLY
0
Entering edit mode

Hi, yes it was, but the output was a bit different. The person who provided the original answer suggested I open a new question where he could provide a solution. Thanks.

ADD REPLY
6
Entering edit mode
7.7 years ago

In Python:

Script (script.py):

from itertools import combinations_with_replacement
import sys

with open(sys.argv[1]) as fh:
    # Reading data
    lst = []
    for line in fh:
        sl = line.strip().split('\t')
        cname = sl[0]
        cset = set(sl[1].split(','))
        lst.append((cname, cset))

    # Calculation
    n = len(lst)
    data = [[0 for i in range(n)] for i in range(n)]
    for i, j in combinations_with_replacement(range(n), 2):
        s1 = lst[i][1]
        s2 = lst[j][1]
        score = len(s1.intersection(s2))/float(len(s1.union(s2)))
        data[i][j] = score
        data[j][i] = score

    #Printing data
    header = "\t" + "\t".join([cname for cname, _ in lst])
    print(header)
    for i, row in enumerate(data):
        rowstr = lst[i][0]+"\t"
        rowstr += "\t".join(["{:.2f}".format(val) for val in row])
        print(rowstr)

Run:

python script.py clusters.txt

Output:

    Cluster 1   Cluster 2   Cluster 3   Cluster 4
Cluster 1   1.00    0.00    0.33    1.00
Cluster 2   0.00    1.00    0.00    0.00
Cluster 3   0.33    0.00    1.00    0.33
Cluster 4   1.00    0.00    0.33    1.00
ADD COMMENT
1
Entering edit mode

Don't you just love python for this! Thing of beauty!

ADD REPLY
1
Entering edit mode

Absolutely brilliant! Thank you so much. Hopefully i'll be able to do this kind of stuff by myself in the near future.

ADD REPLY

Login before adding your answer.

Traffic: 1933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6