Hi everyone,
I am fairly new to bioinformatics and I am a bit stumped on how to go about doing a comparison of my data.
I currently have a file containing about 318 protein clusters. It looks something like this:
Cluster 1 CSF2,NRAS,GSK3A,GSK3B,...
Cluster 2 MAP3K7,HLA-DRA,NFKBIA,ZAP70,...
Cluster 3 CSF2, NRAS, GRIN1, CDKN1A,...
...
I wish to compare the proteins in each cluster and assign a similarity score based on the seed cluster chosen. So, lets take Cluster 1 as the one all are compared to for example, if half of the proteins in Cluster 2 match with any in Cluster 1 then Cluster 2 would have a 50% similarity score, and so on going through the entire list of clusters. The number of proteins in each cluster is different, and so the score should be based on each individual clusters total number of proteins. Output can be flexible, so perhaps something like print all clusters with a score greater than 60%.
Any advice on how I would go about doing something like this in either R or Python would be greatly appreciated.
Thank you,
Adrian
Provide some more details about your input file. For example, in Cluster 1 and Cluster 2 there are no spaces between gene names. Cluster 3 has spaces. Is this how the original file really looks like?
Hi, the input file is an excel file with 2 columns. Columns are delimited by tabs, and gene names by commas.