Question

Correlation of two genes on 2 observations

0

Entering edit mode

10.2 years ago

thejustpark ▴ 80

Hello,

In my analysis, I selected correlated gene pairs in TCGA using many samples there, and found an interesting trend on the correlated genes.

Now that we want to show the trend using our data, we conducted an experiment with two replicates.

I first need to identify correlated gene pairs from the experiment; for example, I want to know if geneA, whose expression on the replicates are (10, 20), is correlated with geneB, whose expression (20, 30).

But if you feed the values to cor.test for example, it won't go further, because of small observations.

If any of you have done something like this before (calculating correlations using small number of observations), can you please point me to the literature or show me how to do this?

Actually, I was thinking to kinda bootstrap the values; for example, feeding geneA expression as (10, 20, 10, 20) and geneB expression (20, 30, 20, 30) to Pearson's r. I believe there should be an analysis about something like this out there that studies its power and such, but can't find one until now. Can you please let me know if you know of this?

Thanks always!

correlation gene-expression • 3.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by thejustpark ▴ 80

0

Entering edit mode

Michael,

Thank you for the cautions.

Yes, I agree that your advice is something we always need to keep it mind.

It's just that my description is simplified just to ask the question.

My replicates are biological replicates, and I conduct additional steps after identifying correlated gene pairs with input from biologists, and the trend in TCGA is verified with biologists.

It's just that the step of identifying correlated pairs is particularly difficult in the experimental validation step, because we have only two replicates.

But again, thanks for the caution, and I will keep that in my mind.

HJ.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by thejustpark ▴ 80

2

Entering edit mode

The easiest way to solve your problem might be to add more replication to be able to calculate sensible correlation. Also I think there is some confusion about the experiment design, because if your replicates are biological replicates then it is hard to believe that there is interpretable biology behind sample correlation, on the other hand if these samples are from different patients, then the correlation might simply indicate that each patient has her own characteristic gene expression pattern that overlays the cancer specific expression pattern. The problem is then, that the patient perspective changes the experimental design completely into looking at per patient effect, an effect for which there is no replication.

As I side note I would be also cautions about "input from biologists" and trends verified by biologists, because biologists might come up with reasonable stories for whatever results are presented (aka. telling the biological story) ;)

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Michael 55k

0

Entering edit mode

Good answer +1

ADD REPLY • link 10.2 years ago by PoGibas 5.1k

0

Entering edit mode

Michael,

Now I see how I confused you.

It is not sample correlation, it is to find correlated genes over samples (of a particular type of cancer).

But again, I agree with you that there can all kinds of crazy error, bias, or batch effect that we must be very cautious in every step.

Anyway, thanks to your the other comment, I take that calculating correlation doesn't make sense for two observations, and probably have to try to add more replication.

Thanks a lot!

HJ.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by thejustpark ▴ 80

0

Entering edit mode

And another word of caution (because it's fun), you need to correct for multiple testing! When testing correlations of 10k genes or more you are dealing with 10^8 comparisons of gene pairs (as the order of magnitude). At the 0.05 confidence level that means one would expect to see 5 million gene pairs to come up just by chance. For bonferroni adjustment you needed to see p-values as low as 5e-10 to be significant.

ADD REPLY • link 10.2 years ago by Michael 55k

0

Entering edit mode

Michael,

I appreciate your kind comments.

Yes, many details were buried under the simplification (that I now repent doing), and I did apply multiple testing.

But many thanks, it is always important to check details (and yes it is fun!).

HJ.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by thejustpark ▴ 80

Ram · Answer 1 · 2014-09-08

When you look at only two samples, calculating correlation doesn't make sense, because it is always either 1, or -1 (or NA in case of zero standard deviation), even if both values are simply random.

Also, please be cautious before you jump to conclusions, your discovery on multiple samples is possibly not something to be happy about. Your correlations might indicate a systematic error, bias, or batch effects in the samples. I imagine the following, your replicates are indeed repeated measurements of samples from the same condition, then any correlation indicates a (non-biological) bias. If the measurements were independent one would in fact expect to not see correlation between repeated measurements, but if you do see it it means that the analysis depends on the order of the samples in the analysis.

A simple example are RNA-seq measurements which are uncorrected for library size, those will show strong influence of library size on the read counts per gene, and therefore the raw read counts can be highly correlated. Your expression value example can f.ex. be easily explained by each second measurement having double the number of reads than the other samples.