Question

Is logCPM expression data good for plottinng?

0

Entering edit mode

6.1 years ago

Biologist ▴ 290

I have downloaded cell-line RNA-Seq readcounts data and performed differential analysis between groups A and B. Among the differential expressed genes I'm interested in 4 genes which are upregulated in group A. So, I extracted the counts data for that four genes which looks like below.

    AU565_BREAST    BT474_BREAST    BT483_BREAST    BT549_BREAST    CAL120_BREAST   CAL51_BREAST    CAL851_BREAST   CAMA1_BREAST    HCC1143_BREAST  HCC1419_BREAST  HCC1428_BREAST  HCC1500_BREAST  HCC1599_BREAST  HCC1806_BREAST  HCC1937_BREAST  HCC1954_BREAST  HCC202_BREAST   HCC2157_BREAST  HCC2218_BREAST  HCC38_BREAST    HCC70_BREAST    HDQP1_BREAST    HS578T_BREAST   MCF7_BREAST MDAMB134VI_BREAST   MDAMB157_BREAST MDAMB175VII_BREAST  MDAMB231_BREAST MDAMB361_BREAST MDAMB415_BREAST MDAMB436_BREAST MDAMB468_BREAST SKBR3_BREAST    T47D_BREAST UACC812_BREAST  UACC893_BREAST  ZR751_BREAST    ZR7530_BREAST
Gene1   4   10  28  53  9   5044    9   9   1060    3   2   2   1522    1247    10  28  1   0   5   826 15  3124    111 2   2   158 213 13  0   0   190 7   3   5   9   3   3   4
Gene2   64  270 96  309 873 1392    6649    49  9970    43  90  605 7956    3770    2134    4291    294 975 75  7544    6220    3776    256 122 43  8758    608 1951    244 3918    520 2170    208 391 116 127 281 91
Gene3   0   0   6   50  2004    3   76  3   4586    0   1   0   20  16  788 347 1   5661    0   31  287 437 2   2   2   0   4   107 165 7   183 294 1   0   8   6   0   3
Gene4   2   2   0   989 16937   532 35  3   1   0   0   0   1   205 2841    13  4   2   0   198 94  3   89  1   2   139 0   896 111 64  73  45  0   0   5   1   4   6

I wanted to plot the expression data of the genes. So, first I did some google search and found that logCPM from counts is good for visualization and also this expression between tumor and normal plot

I converted the counts to logCPM.

logCPM <- (counts, prior.count = 2, log = TRUE)

    AU565_BREAST    BT474_BREAST    BT483_BREAST    BT549_BREAST    CAL120_BREAST   CAL51_BREAST    CAL851_BREAST   CAMA1_BREAST    HCC1143_BREAST  HCC1419_BREAST  HCC1428_BREAST  HCC1500_BREAST  HCC1599_BREAST  HCC1806_BREAST  HCC1937_BREAST  HCC1954_BREAST  HCC202_BREAST   HCC2157_BREAST  HCC2218_BREAST  HCC38_BREAST    HCC70_BREAST    HDQP1_BREAST    HS578T_BREAST   MCF7_BREAST MDAMB134VI_BREAST   MDAMB157_BREAST MDAMB175VII_BREAST  MDAMB231_BREAST MDAMB361_BREAST MDAMB415_BREAST MDAMB436_BREAST MDAMB468_BREAST SKBR3_BREAST    T47D_BREAST UACC812_BREAST  UACC893_BREAST  ZR751_BREAST    ZR7530_BREAST
Gene1   15.81534632 15.13595373 17.71879101 15.22778321 10.02265056 19.46424804 10.90293967 17.10581343 16.06132528 16.00421128 14.42959371 11.92100226 17.29334676 17.86287538 11.17797231 12.6803334  11.93521678 9.196524783 15.94337055 16.5587149  11.47739858 18.69947881 17.8885792  13.99402373 15.33577937 14.13692437 17.97961262 12.27702792 9.196524783 9.196524783 17.58815309 11.71666849 13.84559361 13.68804974 16.00421128 14.45499535 13.42405629 15.25130209
Gene2   19.80151825 19.86802375 19.49361648 17.75293367 15.44394046 17.60990597 19.90493181 19.54569096 19.28374957 19.83348375 19.88344438 19.92596329 19.67515375 19.45660009 18.49640436 19.80591212 19.9015928  17.16835827 19.83766892 19.74200108 19.84173113 18.97259975 19.09218622 19.87280923 19.74239495 19.8826374  19.49070132 19.32637171 18.8400491  19.90482749 19.03793654 19.71742002 19.90325742 19.91240087 19.68033902 19.82144116 19.89524428 19.73819778
Gene3   9.196524783 9.196524783 15.51071491 15.14504187 16.63201224 9.989490832 13.52664583 15.53280377 18.16494858 9.196524783 13.46745133 9.196524783 11.39324708 11.8288248  17.06301619 16.18806413 11.93521678 19.70117632 9.196524783 12.03189366 15.42244264 15.87396409 12.27274833 13.99402373 15.33577937 9.196524783 12.40660735 15.15986204 18.27649735 11.19197535 17.53416163 16.83986496 12.37127695 9.196524783 15.83589516 15.43602658 9.196524783 14.8434806
Gene4   14.82995162 12.90512275 9.196524783 19.42866012 19.70386922 16.22906914 12.48966785 15.53280377 9.345795955 9.196524783 9.196524783 9.196524783 9.434355615 15.27604692 18.90867329 11.71503095 13.7632545  9.793864295 9.196524783 14.52562066 13.85116605 9.958492229 17.57075876 13.04499079 15.33577937 13.95849295 9.196524783 18.20524388 17.7058873  14.0199393  16.21499069 14.171432   9.196524783 9.196524783 15.16648035 12.9435081  13.81971106 15.82901231

Surprisingly, I see that all the values of Gene2 are almost close. For Gene3 in counts data you will see the 4th value 50 counts and in logCPM you see 15.14504187, but for the same Gene3 in counts data you will see the 3rd value 6 counts and in logCPM you see 15.51071491. Looks like something wrong in this?

I actually wanted to plot the logCPM data for each gene between the two groups something like in a boxplot.

First I transposed the logCPM.

t_logCPM <- t(logCPM)

To t_logCPM I have added the groups column and plotted the data for Gene2 in boxplot.

ggboxplot(t_logCPM, x = "Type", y = "Gene2", 
               color="Type", palette = c("#FC4E07", "#FC4E07"),
               order = c("A", "B"), add = "jitter", shape="Type",
               ylab = "Gene2 (logCPM)")

Boxplot looks like this How come the expression of Gene2 is highly expressed in group B in the plot? It should be highly expressed in group A.

Any help is appreciated. thanq

RNA-Seq R geneexpression plotting edger • 3.9k views

ADD COMMENT • link 6.1 years ago by Biologist ▴ 290

1

Entering edit mode

could you check you get the same results when you compute logCPM on the whole of your counts matrix prior to extracting this subset of the genes

ADD REPLY • link 6.1 years ago by russhh 5.7k

0

Entering edit mode

When I did the whole matrix from counts to logCPM and then checked the expression of Gene2. For counts less than 100 it gave negative values and expression values looks completely different from what I mentioned in the question

ADD REPLY • link 6.1 years ago by Biologist ▴ 290

1

Entering edit mode

IMO you should normalise on the whole dataset not a subset of genes.

ADD REPLY • link 6.1 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Yes, you are right. I calculated logCPM for the whole dataset and then extracted the required genes logCPM values. This looks fine now for Gene2 Gene2 (logCPM) boxplot. But for Gene4 when I plotted the logCPM data I could also see negative value on y-axis. Is that ok or something wrong in that Gene4 (logCPM)

ADD REPLY • link 6.1 years ago by Biologist ▴ 290

1

Entering edit mode

nothing wrong with a negative value if it was generated by log-transforming a positive value. The CPM value must have been less than one.

ADD REPLY • link 6.1 years ago by russhh 5.7k

0

Entering edit mode

Thank a lot for the answer.

ADD REPLY • link 6.1 years ago by Biologist ▴ 290

1

Entering edit mode

if CPM < 1 then log10(CPM) will be < 0. Example log10(0.1) = -1

ADD REPLY • link 6.1 years ago by Nicolas Rosewick 11k