Hi!
I would greatly appreciate it if you could lend me some advise on the quantitative analysis of sample similarity of microarray data.
The Data:
8 different biological samples, all with 3 technical replicates
Illumina WG-6 microarray, quantile normalised and log2 transformed
Multiple probes summarised into one per gene
Initially I have done hierarchical clustering on the data matrix, using euclidean distances and average linkage. I have also generated an MDS plot where there is clear sample separation. Both the dendogram and MDS plot show the clustering of the samples as is expected because of their tissue of origin. (I'll upload images later)
The global gene expression, as well as a specific preselected subset will be used and result in something like: Sample_A is most similar to Sample_C, Samples_D is more similar to Sample_Z etc. After talking to my good friend google, two main methods seem to be used, correlation and distance.
So these are the analysis steps I performed
1) calculate Spearmans correlation/Euclidean distances between each sample replicate
2) average the replicates (to get a better overview of the data)
3) rank/order remaining data
The two lists produced by each method is listed below.
Euclidean Distances
Sample_1_vs_Sample_1 8.003162633
Sample_B_vs_Sample_B 8.111607651
Sample_6_vs_Sample_6 8.51449045
Sample_4_vs_Sample_4 8.684158695
Sample_5_vs_Sample_5 9.022024966
Sample_3_vs_Sample_3 9.349750723
Sample_2_vs_Sample_2 9.903293889
Sample_A_vs_Sample_A 9.966555661
Sample_1_vs_Sample_2 23.21577641
Sample_1_vs_Sample_3 34.83212106
Sample_4_vs_Sample_6 35.14049658
Sample_2_vs_Sample_3 36.54938163
Sample_5_vs_Sample_6 46.86066654
Sample_4_vs_Sample_5 47.83052274
Sample_1_vs_Sample_6 84.95058878
Sample_2_vs_Sample_6 85.03191301
Sample_1_vs_Sample_5 86.31027616
Sample_2_vs_Sample_5 86.74889618
Sample_3_vs_Sample_6 88.33253675
Sample_1_vs_Sample_4 88.8302554
Sample_2_vs_Sample_4 89.010459
Sample_3_vs_Sample_5 89.15878373
Sample_3_vs_Sample_4 92.50254662
Sample_B_vs_Sample_5 94.73304572
Sample_B_vs_Sample_6 96.26289691
Sample_B_vs_Sample_4 97.0321506
Sample_1_vs_Sample_B 98.91472002
Sample_2_vs_Sample_B 99.60447516
Sample_3_vs_Sample_B 100.3718217
Sample_A_vs_Sample_6 145.4080426
Sample_A_vs_Sample_1 147.2187797
Sample_A_vs_Sample_4 147.3384896
Sample_A_vs_Sample_5 147.519752
Sample_A_vs_Sample_3 147.7770987
Sample_A_vs_Sample_2 147.8183657
Sample_A_vs_Sample_B 156.9427804
Spearman's correlation
Sample_B_vs_Sample_B 0.974732385
Sample_4_vs_Sample_4 0.963556203
Sample_6_vs_Sample_6 0.958113935
Sample_1_vs_Sample_1 0.957711204
Sample_5_vs_Sample_5 0.957584535
Sample_2_vs_Sample_2 0.956886863
Sample_3_vs_Sample_3 0.953256139
Sample_A_vs_Sample_A 0.943146642
Sample_1_vs_Sample_2 0.928978596
Sample_4_vs_Sample_6 0.924040013
Sample_2_vs_Sample_3 0.916858011
Sample_1_vs_Sample_3 0.916702866
Sample_4_vs_Sample_5 0.912913506
Sample_5_vs_Sample_6 0.90990687
Sample_1_vs_Sample_6 0.855269466
Sample_1_vs_Sample_5 0.854439331
Sample_1_vs_Sample_4 0.853748338
Sample_2_vs_Sample_6 0.852426371
Sample_2_vs_Sample_5 0.851438395
Sample_2_vs_Sample_4 0.851191735
Sample_3_vs_Sample_5 0.840621555
Sample_3_vs_Sample_6 0.840232229
Sample_3_vs_Sample_4 0.8374603
Sample_1_vs_Sample_B 0.835720538
Sample_2_vs_Sample_B 0.835678178
Sample_B_vs_Sample_6 0.835516264
Sample_B_vs_Sample_4 0.834379338
Sample_B_vs_Sample_5 0.830327553
Sample_3_vs_Sample_B 0.82501908
Sample_A_vs_Sample_1 0.750257217
Sample_A_vs_Sample_2 0.748220867
Sample_A_vs_Sample_6 0.743813385
Sample_A_vs_Sample_3 0.743707556
Sample_A_vs_Sample_5 0.743260374
Sample_A_vs_Sample_4 0.741759797
Sample_A_vs_Sample_B 0.714091288
There are some discrepencies between the two lists, but generally they are pretty close.
So my questions essentially are: are my methods statistically justifiable?
Is this the way in which this type of analysis is generally done?
How do I consolidate the data from the two similarity measures?
I have also come across the method of calculating the correlation of correlation, in which first gene wise correlation is calculated and then the resulting values are used to calculate correlation between samples. As described in these papers:
Russ & Futschik - Comparison and consolidation of microarray data sets of human tissue expression
Zheng-Bradley, et al - Large scale comparison of global gene expression patterns in human and mouse
Cope et al - MergeMaid (R implementation in intCor function)
This corCor or Integrative correlation coefficient (IGC) as it is also referred to seems to be mainly used when comparisons are being made across different species, different microarray data sets or studies. However I am wondering whether it also would be appropriate to apply to my data, or whether it would be an overkill.
Any comments, guidelines, advice is greatly appreciated!