I have datasets containing mutations for 26 samples (so 26 different sets of data) and I want to compare how similar they are to themselves. So for example, I would like to compare how Sample 1 is similar to Sample 1 to 26 for each of the 26 samples. At the end, I hope to get something like a heat map where the first row and first column are each of the 26 samples.
Someone has suggested using the intersect and union function in R to calculate the similarity but that would be very laborious as I would have to run the functions 676 times (26*26).
Is there any program to do this quickly or is there a way that I could make this more efficient in R?
What is your data, i.e. how is each sample represented ? What kind of similarity are you looking for, i.e. how do you define similarity between two samples ? Note that running a function ~700 times is not a big deal (in R or any other language) unless each run takes days.
I guess, you are looking for correlation map, not heatmap. Correlation maps compares all against all (26 x 26) samples.
If the variables are continuous and the expect trend is linear then you can do correlation. There are alternative distance measures which you can also use to give you a better idea of how closely two variables are related to one another. Though you should be careful which clustering methods you use because they have statistical assumptions which need to be meet in order to be used properly.
Thanks for the replies! My data is in a single column and contains information about types of SNPs in the format Chr1_Pos_A_T for example. This is all stored in a single column in a text file.
So for each sample you have a list of SNP positions. If the similarity you're after is about the fraction of SNPs two samples share, then you could use the Jaccard index or any other measure of similarity between sets. As you've been told, you can use the R intersect() and union() functions or convert your data to binary vectors and use similarity functions from the proxy package. However, if you have very large numbers of SNPs, it is possible that the similarities become meaningless due to the distance concentration phenomenon.