Differential Gene Expression Analysis using data_RNA_Seq_v2_expression_median RSEM.Normalized
1
1
Entering edit mode
7.2 years ago
gokce.ouz ▴ 70

Hi All,

We have a specific gene mutation and we would like to learn how it is effective on Breast cancer.

So using the R, I get the mutation information from sequenced cases of TCGA Provisional and then stratified patients into two categories as Mutated & Wild Type. I downloaded the mRNA Expression z-Scores (RNA Seq V2 RSEM) from the cBioPortal website. I would like to look at the differentially expressed gene between these two groups but I have several questions :

  1. The RNA seq data is Rsem.normalized, before I do any further analysis I transformed them into log2(rsem+1), that is correct right ?

  2. For differential gene expression analysis what do you suggest me to use ? I cannot use DeSEQ2 or edgeR as they require raw counts as input.

  3. I used limma package but I guess I get shows my data has some problem . Does it look ok or should I do something else ?  Voom & Final model


library(edgeR)
library(limma)
group = c( rep("Mut", 191), rep("WT", 660))
design <- model.matrix(~ 0 + group)
colnames(design) <- c("Mut", "WT")
y = TCGA_comb
par(mfrow=c(1,2))
v <- voom(y,design,plot = TRUE)
fit <- lmFit(v, design)
cont.matrix <- makeContrasts(PIK3CA_mutVSwt=Mut - WT,levels=design)
fit.cont <- contrasts.fit(fit, cont.matrix)
fit.cont <- eBayes(fit.cont)
plotSA(fit.cont)
summa.fit <- decideTests(fit.cont)
tab <- topTable(fit.cont, n=Inf, coef="PIK3CA_mutVSwt")
  1. Would it be too superficial if I calculate Fold Change, p-value & FDR on my own?

     a) Fold change: Take average of each gene per group and then Log2(B)-Log2(A)
    
     b) p-value: t.test command of R
    
     c) FDR: p.adjust(pvalue,method="fdr")
    

Many many thanks,

Gokce

RNA-Seq limma R DGE • 4.9k views
ADD COMMENT
3
Entering edit mode
7.2 years ago

Hi Gokce

Answers below

1) Yes you can you the log-transformation as you describe - but you could also use a smaller pseudo-count (say 0.01) punishing smaller changes less.

2) Raw count data from TCGA's RNA_Seq_V2 are available via the CDC data portal or a most of other APIs such as TCGAbiolinks

3) Don't do that - limma uses normality as an assumption (just like a t.test) - so if you dont trust the limma results you should not trust a t.test. Use voom+limma with raw counts obtained as in 2) instead (limma instead of edgeR since they perform almost identical when you have many replicates (which you have right?) - expect that limma is much much faster.

Hope this helps

ADD COMMENT
0
Entering edit mode

Dear Kristoffer,

Thanks a lot for your answer.

I knew I would be more comfortable with the raw data however I was being forced to work on this normalized data. Maybe it is better to insist on working on the raw data again.

So to clarify, you are suggesting me to use voom+limma on the raw count data instead of edgeR & DESeq2, right ?

ADD REPLY
2
Entering edit mode

Yes i suggest voom-limma because when you have many replicates (which you do in the TCGA data) they perform similar in benchmark. The difference then becomes that instead of waiting hours on edgeR or DESeq2 you can wait seconds/minuts on limma allowing much greater flexibility.

ADD REPLY
0
Entering edit mode

Thanks a lot for your answer.

ADD REPLY

Login before adding your answer.

Traffic: 2640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6