Hello,
In my analysis, I selected correlated gene pairs in TCGA using many samples there, and found an interesting trend on the correlated genes.
Now that we want to show the trend using our data, we conducted an experiment with two replicates.
I first need to identify correlated gene pairs from the experiment; for example, I want to know if geneA, whose expression on the replicates are (10, 20), is correlated with geneB, whose expression (20, 30).
But if you feed the values to cor.test for example, it won't go further, because of small observations.
If any of you have done something like this before (calculating correlations using small number of observations), can you please point me to the literature or show me how to do this?
Actually, I was thinking to kinda bootstrap the values; for example, feeding geneA expression as (10, 20, 10, 20) and geneB expression (20, 30, 20, 30) to Pearson's r. I believe there should be an analysis about something like this out there that studies its power and such, but can't find one until now. Can you please let me know if you know of this?
Thanks always!
Michael,
Thank you for the cautions.
Yes, I agree that your advice is something we always need to keep it mind.
It's just that my description is simplified just to ask the question.
My replicates are biological replicates, and I conduct additional steps after identifying correlated gene pairs with input from biologists, and the trend in TCGA is verified with biologists.
It's just that the step of identifying correlated pairs is particularly difficult in the experimental validation step, because we have only two replicates.
But again, thanks for the caution, and I will keep that in my mind.
HJ.
The easiest way to solve your problem might be to add more replication to be able to calculate sensible correlation. Also I think there is some confusion about the experiment design, because if your replicates are biological replicates then it is hard to believe that there is interpretable biology behind sample correlation, on the other hand if these samples are from different patients, then the correlation might simply indicate that each patient has her own characteristic gene expression pattern that overlays the cancer specific expression pattern. The problem is then, that the patient perspective changes the experimental design completely into looking at per patient effect, an effect for which there is no replication.
As I side note I would be also cautions about "input from biologists" and trends verified by biologists, because biologists might come up with reasonable stories for whatever results are presented (aka. telling the biological story) ;)
Good answer +1
Michael,
Now I see how I confused you.
It is not sample correlation, it is to find correlated genes over samples (of a particular type of cancer).
But again, I agree with you that there can all kinds of crazy error, bias, or batch effect that we must be very cautious in every step.
Anyway, thanks to your the other comment, I take that calculating correlation doesn't make sense for two observations, and probably have to try to add more replication.
Thanks a lot!
HJ.
And another word of caution (because it's fun), you need to correct for multiple testing! When testing correlations of 10k genes or more you are dealing with 10^8 comparisons of gene pairs (as the order of magnitude). At the 0.05 confidence level that means one would expect to see 5 million gene pairs to come up just by chance. For bonferroni adjustment you needed to see p-values as low as 5e-10 to be significant.
Michael,
I appreciate your kind comments.
Yes, many details were buried under the simplification (that I now repent doing), and I did apply multiple testing.
But many thanks, it is always important to check details (and yes it is fun!).
HJ.