Hello Everyone,
I'm looking for some help regarding comparison between same samples of different platforms. Initially I had Microarray data for 7 samples. The same sample ids were also used for RNA-Sequencing.
So, for both Microarray and RNA-seq I have same 7 samples. But, when I did clustering of 7 RNA-seq samples they behave in a different way. I worry that the sample ids got mixed either in the RNAseq or microarray. I wanted to check the correlation between the samples of both Microarray and RNAseq platforms.
Here I'm showing some data from both platforms.
Microarray data is shown below (Affymetrix SNP 6.0 data). This is how the expression data looks.
Genes Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7
Gene1 131.4311 369.4926 222.0441 687.4181 176.8892 258.1233 316.5573
Gene2 78.73022 83.97501 81.56039 86.11443 78.09758 81.88231 84.17101
Gene3 90.02816 95.07267 101.1761 93.35585 81.96468 94.3553 93.89527
Gene4 79.86837 81.63064 88.19524 79.47265 76.3437 101.6351 93.71674
Gene5 99.03493 109.1835 104.97 102.7423 108.3677 98.93459 101.4052
Gene6 79.58075 84.28915 90.53562 74.47786 75.96112 96.39649 95.8828
Gene7 121.5373 149.9351 146.5956 122.8523 110.5759 132.4268 130.4409
Gene8 616.5994 1326.735 1358.187 2315.851 1068.745 3229.759 4435.021
Gene9 70.44073 69.56772 68.25446 68.35857 70.86771 74.3843 67.93569
Gene10 78.62103 78.9498 76.96349 73.12749 76.37209 82.27274 82.54192
Gene11 69.45438 76.9461 80.96048 80.35287 78.84947 84.79259 88.00973
Gene12 84.79181 81.74586 81.13312 77.90322 79.47303 77.37466 75.77509
Gene13 86.88158 91.10217 90.85628 78.9453 76.18599 86.22007 83.63694
RNA-Seq data looks like below. I have the raw counts data from RNA-Seq.
Genes Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7
Gene1 470 1058 263 488 2027 1047 737
Gene2 20 89 14 87 76 169 21
Gene3 16 13 1 0 6 17 6
Gene4 0 0 0 0 0 0 0
Gene5 0 0 0 5 0 0 1
Gene6 2 3 4 30 23 9 23
Gene7 20 0 48 113 92 14 88
Gene8 0 0 0 10 0 3 2
Gene9 40 10 154 18 401 74 318
Gene10 85 51 333 40 632 165 667
Gene11 1897 8879 1725 2645 8519 3823 1983
Gene12 203 593 361 380 524 436 227
Gene13 46 267 117 207 351 240 70
I'm very new to R and this type of analysis about the correlation between the samples. Can anyone tell me how I should do this and what functions I have to use? Will be very helpful if you could show me with the above data.
Any help is appreciated. Thanks a lot.
Simplest thing is to plot the correlation for each sample.
plot(madata$Sample1, rnadata$Sample1)
Do I need to do that for each sample separately? What if I have 50 samples in each platform?
Yes, if you have 50 samples, you want to calculate all their pairwise comparisons to each other. Maybe the corrplot package is helpful here.
Which RNA-seq values did you use for the clustering of the RNA-seq data? I would probably not go with the raw counts for the correlation analysis because the raw counts will also depend on the individual sequencing depths of every sample and so on. You probably want to do that on rlog or cpm values, or, at the bare minimum, the results of
DESeq2::counts(dds, normalized = TRUE)
. For more details onDESeq2
, see Michael Love's tutorial.Yes, I will convert counts to normalized counts and use it. Do I need to take same number of genes in both the platforms for correlation?
If you are missing genes in one assay, you can hardly compare its expression values between the two methods. Unless you had another way of calculating the correlation in mind, my answer would be: yes, absolutely do you need to look at the same genes in both assays.