Question

Finding a gene signature among multiple group

2

Entering edit mode

4.2 years ago

Raheleh ▴ 260

Hi all, I have a expression profile (RNAseq) of 580 immune-related genes for 350 samples that are already classified into 4 subtypes. I want to find the genes (out 0f 580) that significantly have different expression among these 4 subtypes. In an article Significance analysis of microarrays (SAM) method was used to identify differential expression gene between multiple groups. However my data is RNAseq and I couldn't figure out how to use SAM for my data?

Is there any other ways or packages in r that can do this? Really appreciate any suggestion!

SAM RNAseq signature • 1.9k views

ADD COMMENT • link updated 2.0 years ago by 1769mkc ★ 1.2k • written 4.2 years ago by Raheleh ▴ 260

0

Entering edit mode

As @Kevin said here I used ANOVA to find genes which are significantly expressed across my four subtypes. To apply ANOVA on all my 580 genes I used this function:

AOV <- function(Gene){
      y <- df[,c(Gene, "type")]
      colnames(y)[1]="Gene"
      anova(aov(Gene~type, y))[1,5]
}

res.aov <- sapply(genes, AOV)
res.aov_sig <- subset(res.aov, res.aov<0.05)

Then I filtered based on pval <0.05, now I have 514 genes that is still quite high and don't give a good heatmap that shows discrimination between 4 groups. I just want to keep those genes which are expressing significantly in each group but not in 3 others. Any help or suggestion? Many thanks in advance!

ADD REPLY • link 4.2 years ago by Raheleh ▴ 260

score 4 · Answer 1 · 2020-09-22

4

Entering edit mode

4.2 years ago

Kevin Blighe 88k

A multinomial penalised regression could work here, but I am not sure of your level of experience (see glmnet's vignette). ANOVA should also help. The issue is that you have too many genes from ANOVA? - why not reduce the p-value threshold to, e.g., p<0.001? This would keep it very simple.

If the issue instead is that you want to use Significance analysis of microarrays (SAM), then you can still use this with RNA-seq data. You should normalise your RNA-seq counts and then transform these so that the distribution of your final data is normal. In EdgeR, this would be normalisation followed by cpm(dge, log = TRUE, prior.count=3), while, in DESeq2, it would be normalisation followed by rlog() or vst().

Always know your data's distribution by plotting it as a histogram (hist())

Kevin

ADD COMMENT • link 4.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks @Kevin, after applying anova pval < 0.001 I got 497 genes. Then to do multinomial penalised regression I followed your comprehensive reply in this post and did every step as you said and now after applying this chunk of the script:

g1 <- rownamesco.se$S1)[whichco.se$S1 != 0)]
g2 <- rownamesco.se$S2)[whichco.se$S2 != 0)]
g3 <- rownamesco.se$S3)[whichco.se$S3 != 0)]
g4 <- rownamesco.se$S4)[whichco.se$S4 != 0)]

I have 121 genes as best predictors (g1=30, g2=30, g3=34, g4=27). Now for downstream applications here: (Can I build my predictive model by all 121 genes or still needs more filtering?)

finalLasso <- glm(modellingLasso$Type ~ all 121 gene,
  data=modellingLasso,
  family=binomial(link="logit"))

Many thanks for any help and suggestion!

ADD REPLY • link 4.2 years ago by Raheleh ▴ 260

0

Entering edit mode

Hi @Kevin I'll be very thankful if you advice me on this result.

all my 475 genes that I included in this analysis have non-zero coefficient.

Call:  glmnet(x = data.matrix(modellingLasso[, -1]), y = modellingLasso$type,      family = "multinomial", alpha = 0, standardize = TRUE) 

     Df  %Dev Lambda
1   475  0.00 346.00
2   475  4.92 330.30
3   475  5.13 315.30
4   475  5.34 301.00
5   475  5.56 287.30
6   475  5.78 274.20
7   475  6.01 261.80
8   475  6.25 249.90
9   475  6.50 238.50
10  475  6.76 227.70
11  475  7.02 217.30
12  475  7.29 207.40
13  475  7.57 198.00
14  475  7.86 189.00
15  475  8.16 180.40
16  475  8.47 172.20
17  475  8.78 164.40
18  475  9.11 156.90
19  475  9.44 149.80
20  475  9.79 143.00
21  475 10.14 136.50
22  475 10.49 130.30
23  475 10.84 124.40
24  475 11.22 118.70
25  475 11.61 113.30
26  475 12.00 108.20
27  475 12.41 103.20
28  475 12.83  98.55
29  475 13.26  94.07
30  475 13.69  89.80
31  475 14.14  85.72
32  475 14.59  81.82
33  475 15.06  78.10
34  475 15.53  74.55
35  475 16.02  71.16
36  475 16.51  67.93
37  475 17.02  64.84
38  475 17.53  61.89
39  475 18.05  59.08
40  475 18.59  56.40
41  475 19.13  53.83
42  475 19.68  51.39
43  475 20.24  49.05
44  475 20.81  46.82
45  475 21.39  44.69
46  475 21.98  42.66
47  475 22.58  40.72
48  475 23.19  38.87
49  475 23.80  37.10
50  475 24.43  35.42
51  475 25.06  33.81
52  475 25.70  32.27
53  475 26.35  30.80
54  475 27.01  29.40
55  475 27.68  28.07
56  475 28.35  26.79
57  475 29.03  25.57
58  475 29.72  24.41
59  475 30.41  23.30
60  475 31.11  22.24
61  475 31.82  21.23
62  475 32.53  20.27
63  475 33.25  19.35
64  475 33.97  18.47
65  475 34.69  17.63
66  475 35.43  16.83
67  475 36.16  16.06
68  475 36.90  15.33
69  475 37.64  14.63
70  475 38.39  13.97
71  475 39.13  13.33
72  475 39.88  12.73
73  475 40.63  12.15
74  475 41.39  11.60
75  475 42.14  11.07
76  475 42.89  10.57
77  475 43.65  10.09
78  475 44.40   9.63
79  475 45.15   9.19
80  475 45.91   8.77
81  475 46.66   8.38
82  475 47.41   7.99
83  475 48.15   7.63
84  475 48.90   7.28
85  475 49.64   6.95
86  475 50.38   6.64
87  475 51.12   6.34
88  475 51.85   6.05
89  475 52.58   5.77
90  475 53.31   5.51
91  475 54.03   5.26
92  475 54.75   5.02
93  475 55.46   4.79
94  475 56.17   4.57
95  475 56.87   4.37
96  475 57.57   4.17
97  475 58.26   3.98
98  475 58.94   3.80
99  475 59.63   3.62
100 475 60.30   3.46

and the image enter image description here

How I can interpret this? but after performing 10-Fold CV, 121 genes returns as non-zero.

Many thanks!

ADD REPLY • link 4.2 years ago by Raheleh ▴ 260

1

Entering edit mode

Any reason why your alpha = 0, i.e., you are doing a ridge regression? If you increase alpha to alpha = 1, more variables will be shrunk to 0 as the penalty is then the lasso penalty, i.e., lasso-penalised regression.

I would always use cv.glmnet, by the way - this is performing a cross-validation.

Have you read this? - https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

ADD REPLY • link 4.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Many thanks @Kevin for being always helpful! Ah, it was a mistake I change it to alpha =1. Yes, I read that and used cv.glmnet. Then I built a new model with 47 genes/predictors that I got with rownamesco.se$CMS1)[whichco.se$type > 0)]

finalLasso <- glm(as.factor(modellingLasso$type) ~ BCL6 + BST2 + CLEC5A + CXCR4+ DPP4 + DUSP4 + GNLY+ JAK2 +KIR3DL1+LCK + MAPK11 + MBP +    
                    PLA2G2A+SPP1 +TAP2  + BID+ CD3EAP+ CXCL2+ IL12A+ MAPK14+NOD2 +TCF7 +TRAF5 + BCL3+BLNK +GFI1+ IL1A+IL22 + ITLN1+KIT + 
                    NOS2 + PTGER4 +TLR4 + TNFRSF11A+BST1 + C1S + C7 +CD81 + ENTPD1 +FCGRT+ FN1+ MARCO +MME + PDGFRB +STAT5B+ THY1 +ZEB1,
                  data=modellingLasso[,-1],
                  family=binomial(link="logit"))

summary(finalLasso)
Call:
glm(formula = as.factor(modellingLasso$type) ~ BCL6 + BST2 + 
    CLEC5A + CXCR4 + DPP4 + DUSP4 + GNLY + JAK2 + KIR3DL1 + LCK + 
    MAPK11 + MBP + PLA2G2A + SPP1 + TAP2 + BID + CD3EAP + CXCL2 + 
    IL12A + MAPK14 + NOD2 + TCF7 + TRAF5 + BCL3 + BLNK + GFI1 + 
    IL1A + IL22 + ITLN1 + KIT + NOS2 + PTGER4 + TLR4 + TNFRSF11A + 
    BST1 + C1S + C7 + CD81 + ENTPD1 + FCGRT + FN1 + MARCO + MME + 
    PDGFRB + STAT5B + THY1 + ZEB1, family = binomial(link = "logit"), 
    data = modellingLasso[, -1])

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-6.159e-05   2.100e-08   2.100e-08   2.100e-08   7.823e-05  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.402e+02  3.341e+06   0.000    1.000
BCL6        -6.658e-01  1.181e+05   0.000    1.000
BST2        -1.348e+01  3.700e+04   0.000    1.000
CLEC5A      -2.915e+00  6.786e+04   0.000    1.000
CXCR4       -2.455e+01  5.529e+04   0.000    1.000
DPP4        -2.338e+01  2.216e+04  -0.001    0.999
DUSP4       -1.176e+01  3.482e+04   0.000    1.000
GNLY         5.055e+00  2.848e+04   0.000    1.000
JAK2        -1.139e+01  1.050e+05   0.000    1.000
KIR3DL1     -9.865e+00  5.458e+04   0.000    1.000
LCK         -2.391e+00  2.890e+04   0.000    1.000
MAPK11       1.094e+01  7.569e+04   0.000    1.000
MBP          7.160e+00  8.727e+04   0.000    1.000
PLA2G2A     -6.152e+00  1.614e+04   0.000    1.000
SPP1         2.827e+00  2.703e+04   0.000    1.000
TAP2        -1.597e+01  3.359e+04   0.000    1.000
BID          9.441e+00  1.599e+05   0.000    1.000
CD3EAP      -3.009e+01  9.002e+04   0.000    1.000
CXCL2        4.222e+00  6.082e+04   0.000    1.000
IL12A       -6.076e+00  2.244e+04   0.000    1.000
MAPK14       4.734e+01  8.924e+04   0.001    1.000
NOD2         3.305e+00  6.419e+04   0.000    1.000
TCF7        -1.026e+00  6.888e+04   0.000    1.000
TRAF5        1.636e+00  6.563e+04   0.000    1.000
BCL3         2.183e+01  6.592e+04   0.000    1.000
BLNK         1.492e+01  3.648e+04   0.000    1.000
GFI1        -2.207e+01  4.799e+04   0.000    1.000
IL1A         9.129e+00  5.346e+04   0.000    1.000
IL22        -1.122e+01  1.006e+05   0.000    1.000
ITLN1        9.118e+00  1.332e+04   0.001    0.999
KIT         -1.484e+01  7.276e+04   0.000    1.000
NOS2         5.121e+00  3.340e+04   0.000    1.000
PTGER4      -1.245e+01  5.688e+04   0.000    1.000
TLR4         1.763e+01  4.466e+04   0.000    1.000
TNFRSF11A   -2.218e+01  1.872e+04  -0.001    0.999


(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.7708e+02  on 444  degrees of freedom
Residual deviance: 7.5710e-08  on 397  degrees of freedom
AIC: 96

Number of Fisher Scoring iterations: 25

Then to do multi-class ROC analysis I used this function:

library(pROC)
roc.multi <- multiclass.roc(modellingLasso$type, fitted(finalLasso))

> roc.multi

Call:
multiclass.roc.default(response = modellingLasso$type, predictor = fitted(finalLasso))

Data: fitted(finalLasso) with 4 levels of modellingLasso$type: S1, S2, S3, S4.
Multi-class area under the curve: 0.7392

I had a problem to visualize the multi-class ROC curve, however with help of this post I could visualize it but I'm struggling how to interpret it and modify it with the legend for each corresponding color/subtype? enter image description here

any help? Many thanks!!

ADD REPLY • link 4.2 years ago by Raheleh ▴ 260

0

Entering edit mode

your all pvalue is almost 1 may be you should look for the expression

ADD REPLY • link 2.0 years ago by 1769mkc ★ 1.2k