I am trying to replicate the analysis conducted in Brennand et al to cluster microarray expression data (or, in my case, RNA-seq data) with data available in the Allen Brain Atlas. I'm afraid their methods description is not clear enough for me to understand what exactly they did and hence how to repeat it.
To quote the paper (which is open-source, btw):
Cross-platform comparisons of our hiPSC microarray and the AllenBrain Atlas microarrays were done by (1) ranking absolute gene expression for each microarray using Partek, (2) assigning a rank difference value for each gene using a MATLAB script, and (3) calculating Spearman Rank Correlation Coefficients for each microarray comparison in Microsoft Excel. Wilcoxon's rank-sum test was assessed if a category of interest (spatial, temporal or combined) had significantly higher Spearman correlations than the background of all pairwise correlations. No hard cutoffs of 'best matches' were used.
I have tried the following approach (using R):
- Merge data frame with microarray data from ABA with my RNA-seq normalized count table. This gets rid of genes that are not interrogated with the microarray.
- Rank data by absolute gene expression using the rank function (min for ties)
- Use the cor(rankedJune, method="spearman") to calculate spearman rank correlation coefficients
- Identify the Wilcoxon rank-sum test for each of the comparisons -
log(pairwise.wilcox.test(correlationSpear2RelevantMelt$value, correlationSpear2RelevantMelt$variable, p.adjust = "bon")$p.value)
The outcome looks like complete bollocks. What am I doing wrong with my approach, and how do I replicate exactly what the authors have done on my data? (I've also tried doing a conventional spearman between my data and the ABA array, and the datasets (predictably) show that they are quite distinct.
Thanks in advance!
I'm more trying to understand how they did it:
Don't they already have a "rank" followed by a z-score like "rank difference" after step 2 on which they are doing Spearmans???
Possibly they are describing how they assign their rank difference in the supplementary, did they provide some source code? I think that step differs from what you were describing. If they look at rank difference, that's something new, it is not the same as doing ranks, but I haven't read the paper yet. Btw, I just checked, ofc using spearman on ranks gives the same result as using pearson on the same ranks, it just doesn't make a difference. So ignore my "you have to use Pearson", but you still can.
Nope, no source code or additional details!!! That's why it's so frustrating!!!
The problem is that their methods description is very confusing to me, AND they make a claim based on this approach that is very much critical for interpreting the results, especially the biology of their system. I've got a similar system which I'm also trying to "map" to the ABA data, so it makes sense to use their approach, cite them and see what the outcome of the mapping is - but I just can't get it to work!
I think there is not too much magic in their steps, but certainly it shows why using proprietary software does no good in science. You should really get the original data, they should have submitted them to some repository? If not write to the author.
I didn't find further information on the methods they used but:
rank.diff <- rank(x) - rank(y), or abs (rank(x) - rank(y))
with x and y sample intensity vectors, that they need a MATLAB script for that, well...cor(rank.diff.matrix, method='spearman')
I guess that the rank difference is meant as something like a "non-parametric log-ratio". Maybe they should have used the rankprod package in Bioconductor.