How to make a scatter plot showing correlation between genes?
0
1
Entering edit mode
6.4 years ago
Vasu ▴ 790

Hi,

I have a matrix with samples as rows and Genes as columns with gene expression values (RPKM).

Following is an Example data. Original data has more than 800 samples.

        LINP1   EGFR            RB1       TP53         CDKN2A      MYC
Sample1 0.02   0.038798682  0.1423662   2.778587067 0.471403939 18.93687655
Sample2 0      0.059227225  0.208765213 0.818810739 0.353671882 1.379027685
Sample3 0      0.052116384  0.230437735 2.535040249 0.504061015 9.773089223
Sample4 0.06   0.199264618  0.261100548 2.516963635 0.63659138  11.01441624
Sample5 0      0.123521916  0.273330986 2.751309388 0.623572499 34.0563519
Sample6 0      0.128767634  0.263491811 2.882878373 0.359322715 13.02402045
Sample7 0      0.080097356  0.234511372 3.568192768 0.386217698 9.068928569
Sample8 0      0.017421323  0.247775683 5.109428797 0.068760572 15.7490551
Sample9 0      2.10281137   0.401582013 8.202902242 0.140596724 60.25989178

To make a scatter plot showing correlation between two genes I used ggscatter

ggscatter(A2, x = "LINP1", y = "RB1", 
          add = "reg.line", conf.int = FALSE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "LINP1", ylab = "RB1")

The scatter plot looks like this LINP1 and RB1. Why the points are not in the direction of regression line? Do I need to change the scales of x-axis and y-axis?

And I want to make a scatter plot like this scatterplot Fig 2g in this Research paper. where it LINP1 expression is showed against all other genes in a single plot. Is it possible with any code?

RNA-Seq scatterplot gene rpkm r • 4.6k views
ADD COMMENT
0
Entering edit mode

use log scale, or log2 transform your data.

ADD REPLY
0
Entering edit mode

Oh ya thanks. just now I see that ggscatter has arguments xscale and yscale. So, Im giving it like this xscale="log2", yscale="log2". And any idea about the plot shown in the paper?

ADD REPLY
0
Entering edit mode

The plot from the paper can be made with par(mfrow=c(1,8)) before you make the 8 plots, make them without axes and create the axes afterwards. See ?plot for more details.

ADD REPLY
0
Entering edit mode

The issue in the plot that you've shown is with the distribution of data-points for your LINP1 gene, i.e., the vast majority of points are either zero or close to it. The gene's distribution is heavily skewed, and this, in turn, 'confuses' whatever statistical methodology you apply to this data.

Thus, your Pearson correlation values are misleading and not valid.

You may consider logging the data (Edit: as noted by my colleague b.nota, too)

ADD REPLY
0
Entering edit mode

yes thanks. I saw that just now. Any idea about the plot in the paper?

ADD REPLY

Login before adding your answer.

Traffic: 2973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6