Question

How to find correlation between two datasets/vectors?

0

Entering edit mode

5.8 years ago

John ▴ 270

Hi there,

I have two datasets. Matrix A of six columns and 300 rows. Matrix B of 12 columns and 300 rows. I want to find correlation between (each) one column of matrix A and all the columns of Matrix B.

Basically to show which column of Matrix B has similar expression pattern as Matrix A. Can anyone help me with R code?

Thanks.

RNA-Seq • 8.5k views

ADD COMMENT • link 5.8 years ago by John ▴ 270

score 4 · Answer 1 · 2019-04-21

4

Entering edit mode

5.8 years ago

Kevin Blighe 89k

To do the correlation, you can just follow this reproducible example using random data:

testmatrix1 <- matrix(rexp(1800, rate=.1), ncol=6)
colnames(testmatrix1) <- paste0("testmatrix1.Sample", 1:ncol(testmatrix1))

testmatrix2 <- matrix(rexp(3600, rate=.1), ncol=12)
colnames(testmatrix2) <- paste0("testmatrix2.Sample", 1:ncol(testmatrix2))

cor(testmatrix1, testmatrix2)

                    testmatrix2.Sample1 testmatrix2.Sample2 testmatrix2.Sample3
testmatrix1.Sample1          0.01296415       -0.0648876401        -0.026437428
testmatrix1.Sample2          0.02507722       -0.0466083100         0.051803640
testmatrix1.Sample3         -0.04509608        0.0885850839         0.016198746
testmatrix1.Sample4         -0.07253889        0.0109682660        -0.005428415
testmatrix1.Sample5          0.05309693        0.0005335641         0.025417901
testmatrix1.Sample6         -0.08103197       -0.0316691412         0.025865644
                    testmatrix2.Sample4 testmatrix2.Sample5 testmatrix2.Sample6
testmatrix1.Sample1        -0.044628657         -0.05145097        -0.084299004
testmatrix1.Sample2         0.091026535         -0.06536601         0.066175930
testmatrix1.Sample3         0.042544779         -0.02039660         0.111458092
testmatrix1.Sample4         0.040775626          0.04140741        -0.003768371
testmatrix1.Sample5        -0.026534599          0.05096871        -0.073291082
testmatrix1.Sample6        -0.008812823          0.08488437        -0.057106243
                    testmatrix2.Sample7 testmatrix2.Sample8 testmatrix2.Sample9
testmatrix1.Sample1        -0.053686505        -0.005479353        -0.024188906
testmatrix1.Sample2         0.151294463         0.030159951        -0.053079907
testmatrix1.Sample3         0.064047992        -0.045964736        -0.090923084
testmatrix1.Sample4        -0.149304622        -0.019311322        -0.063244353
testmatrix1.Sample5        -0.007202293        -0.039786009        -0.006401899
testmatrix1.Sample6        -0.082244307        -0.090395454        -0.027735417
                    testmatrix2.Sample10 testmatrix2.Sample11
testmatrix1.Sample1          -0.01897578          0.023409611
testmatrix1.Sample2           0.08483810          0.108088263
testmatrix1.Sample3          -0.01810386         -0.004953135
testmatrix1.Sample4           0.03711702          0.029325742
testmatrix1.Sample5           0.10329574         -0.026606608
testmatrix1.Sample6           0.02143732          0.044028624
                    testmatrix2.Sample12
testmatrix1.Sample1          -0.01421724
testmatrix1.Sample2          -0.03972874
testmatrix1.Sample3          -0.01106090
testmatrix1.Sample4           0.16243897
testmatrix1.Sample5          -0.04172540
testmatrix1.Sample6           0.04639565

By default, cor() uses Pearson correlation. To use Kendall or Spearman, take a look at the man page.

This does not specifically address the sentiment of "similar expression pattern", though.

Kevin

ADD COMMENT • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

thanks, how can I show whether it has "similar expression pattern"

ADD REPLY • link 5.8 years ago by John ▴ 270

0

Entering edit mode

Here is the limitation of correlation:

x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(10,11,15,17,26,77,190,200,267,300)

cor(x, y)
[1] 0.9337854

cor(x/1000, y*1000)
[1] 0.9337854

cor(x+50, y*1000000000000000000)
[1] 0.9337854

It does not take into account the magnitude of the underlying values.

To actually check expression patterns, I would merge the datasets and then do, for example:

PCA
hierarchical clustering (with heatmap) on the Z-transformed merged dataset (in R, just use scale(x) for col scaling or t(scale(t(x))) row scaling)

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

thanks for enlightening me!

but the problem is they are from different platform, one is single cell ,another is ribotag seq.

here's my explanation

enter link description here

ADD REPLY • link 5.8 years ago by John ▴ 270

1

Entering edit mode

So, you are aiming to do some sort of multi-omics analysis? In that case, the correlation may be one of the easiest options. You can also use the cor.test function in place of cor() in order to derive a p-value for the correlation.

Another option may be to perform a regression between the datasets based on the genes.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

okay, Thanks a-lot!

How can I make sure whether the cor.test function exactly find correlation between two matrix with same order of genes? (as they are two different data-frames, I'm afraid if the order is wrong!)

my code is below, x is one matrix, y is another.

for (j in 1:length(x)) {
    for (i in 1:length(y)) { a <- cor.test(x[,j], y[,i], method = "spearman") 
    output[i,j]<- paste(a$estimate)
    }
    }

ADD REPLY • link 5.8 years ago by John ▴ 270

0

Entering edit mode

Hey, to create the matrix of p-values for the correlation test, it is indeed a bit more difficult and requires a loop of some sort. To build on the code that you have written, you just need to do:

create the random datasets again

testmatrix1 <- matrix(rexp(1800, rate=.1), ncol=6)
colnames(testmatrix1) <- paste0("testmatrix1.Sample", 1:ncol(testmatrix1))
testmatrix2 <- matrix(rexp(3600, rate=.1), ncol=12)
colnames(testmatrix2) <- paste0("testmatrix2.Sample", 1:ncol(testmatrix2))

create empty data-frame with rownames as the colnames of testmatrix1

output <- data.frame(row.names = colnames(testmatrix1))

double loop to calculate p-values

for (j in 1:ncol(testmatrix1)) {
  for (i in 1:ncol(testmatrix2)) {
    a <- cor.test(testmatrix1[,j], testmatrix2[,i], method = "spearman") 
    output[j,i] <- a$p.value
    colnames(output)[i] <- colnames(testmatrix2)[i]
  }
}

output
                    testmatrix2.Sample1 testmatrix2.Sample2 testmatrix2.Sample3
testmatrix1.Sample1           0.3518863          0.47606368          0.10102978
testmatrix1.Sample2           0.3961174          0.56423988          0.99568219
testmatrix1.Sample3           0.7408607          0.39592930          0.29648679
testmatrix1.Sample4           0.9291928          0.98327153          0.18853045
testmatrix1.Sample5           0.6518009          0.07268147          0.08976792
testmatrix1.Sample6           0.8339552          0.09669928          0.35602726
                    testmatrix2.Sample4 testmatrix2.Sample5 testmatrix2.Sample6
testmatrix1.Sample1           0.9092094           0.3305275          0.82292450
testmatrix1.Sample2           0.8855989           0.6614349          0.37536888
testmatrix1.Sample3           0.1092753           0.9206014          0.56385096
testmatrix1.Sample4           0.5927385           0.1078176          0.08340433
testmatrix1.Sample5           0.4327882           0.6604728          0.28091846
testmatrix1.Sample6           0.7108822           0.3686982          0.89042879
                    testmatrix2.Sample7 testmatrix2.Sample8 testmatrix2.Sample9
testmatrix1.Sample1          0.39665635           0.6124580           0.9679599
testmatrix1.Sample2          0.19092646           0.5742919           0.2318132
testmatrix1.Sample3          0.03735899           0.2470271           0.9994461
testmatrix1.Sample4          0.16147931           0.6147381           0.3380712
testmatrix1.Sample5          0.07368838           0.3759070           0.2812890
testmatrix1.Sample6          0.08264656           0.5687919           0.2017268
                    testmatrix2.Sample10 testmatrix2.Sample11
testmatrix1.Sample1           0.25252290            0.6361464
testmatrix1.Sample2           0.09881259            0.1207577
testmatrix1.Sample3           0.85420424            0.6877239
testmatrix1.Sample4           0.14445696            0.5322093
testmatrix1.Sample5           0.35637973            0.5900293
testmatrix1.Sample6           0.42454106            0.5896109
                    testmatrix2.Sample12
testmatrix1.Sample1            0.4367764
testmatrix1.Sample2            0.5291499
testmatrix1.Sample3            0.7566900
testmatrix1.Sample4            0.3309439
testmatrix1.Sample5            0.2748988
testmatrix1.Sample6            0.6713377

Now, check some values to ensure that it is correct:

cor.test(
  testmatrix1[,"testmatrix1.Sample4"],
  testmatrix2[,"testmatrix2.Sample7"],
  method = "spearman")$p.value
[1] 0.1614793

cor.test(
  testmatrix1[,"testmatrix1.Sample6"],
  testmatrix2[,"testmatrix2.Sample3"],
  method = "spearman")$p.value
[1] 0.3560273

cor.test(
  testmatrix1[,"testmatrix1.Sample1"],
  testmatrix2[,"testmatrix2.Sample11"],
  method = "spearman")$p.value
[1] 0.6361464

Note that, for a larger data-matrix, we could compute-parallelise this with the foreach function and the %dopar% operator

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks for the beautiful explanation, but my doubt is.. the row names(genes) of those two matrix is not in the same order.

My intention is to find whether the gene expression pattern is same or not. I used this, I'm not sure it whether it will stay in same order while using cor.test()

x<- x[order(rownames(x)),]; y<- y[order(rownames(y)),]

ADD REPLY • link 5.8 years ago by John ▴ 270

1

Entering edit mode

Both matrices have 300 genes, right? You want to ensure that the order of the genes is the same?

After you use the order() function, a simple check like this can be performed:

table(rownames(x) == rownames(y))

If you see just TRUE, then the gene names are the same and in the same order.

If you see a FALSE, then there is a discrepancy. You can use which() and match() to harmonise them.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

That's TRUE, thanks....

ADD REPLY • link 5.8 years ago by John ▴ 270

0

Entering edit mode

I'm not sure why I'm getting Heatmap like this. This code is not working, only for this data.

z<- output; zz<- melt(as.matrix(z)) ggplot(data = zz, aes(x=Var2, y=Var1, fill=value)) + geom_tile()

heatmap

And also If I plot the matrix and transpose of matrix i.e t(x), it should give the same kind of heat map right? it's giving different one!!

here is correlation matrix

> head(output)
                    1                  2                  3                  4
la -0.237894537953645 -0.251502630862629 -0.264305063132588 -0.245031084597787
lb -0.215517768370364 -0.229088729495185 -0.242997143937622 -0.224441452675199
lc -0.231200458977979 -0.246982412009137 -0.261266470629486 -0.243472483950217
ld -0.242659217031868 -0.258090770669682 -0.272427581120237 -0.253981579777381
le -0.246959115443146 -0.262707629126362  -0.27741638335168  -0.26039059644323
lf -0.253397678701017 -0.269383847422869 -0.285295861526805  -0.26807596290528

ADD REPLY • link 5.8 years ago by John ▴ 270

0

Entering edit mode

The output that you get (in the figure) tells me that your numerical data is encoded as categorical, for whatever reason. Please check each step and use the function str() to see where this automatic conversion is taking place. Sometimes it may happen if there is a 'rogue' space or comma (or other non-numeric character) embedded in your numerical data somewhere.

Looking at your code, I suspect that as.matrix(z) is not doing what you think. Try data.matrix(), or just have melt(x)

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k