Question

RNA Seq data with GSEA

0

Entering edit mode

5.6 years ago

pennakiza ▴ 60

Hi all,

I am a bit confused with how I should be transforming my data to use it with GSEA. I have tried two different approaches and I am not sure which one is the best.

First, I have created a DGElist, filtered my low count reads and transformed it with voom. These data I fed then to GSEA.

y <- DGEList(counts = fc_mydata$counts, genes=fc_mydata$annotation[,c("GeneID","Length")])
keep<-filterByExpr(y)
y<-y[keep,]
y<-calcNormFactors(y, method="TMM")
y_voom<-voom(y) 
GSEA_table<-y_voom$E
colnames(GSEA_table)<-pastemydata$Treatment[match(colnames(GSEA_table),mydata$Sample)],mydata$DaysToRebound,mydata$SampleID,sep="_")
 GSEA_table<-GSEA_table[,order(colnames(GSEA_table))]
 symb<-annotation$Symbol[match(rownames(GSEA_table),annotation$GeneID)]
 write.table(file="GSEA_table_final.txt",cbind("NAME"=symb,GSEA_table),row.names = F,quote = F,sep = "\t")

Second approach, I ran DESeq2 on raw counts, transformed it with vsn and fed that to GSEA.

dds<-DESeqDataSetFromMatrix(countData=fc_mydata$counts, colData=mydata, ~Treatment)
dds<-DESeq(dds)
res<-results(dds)
res_ordered<-res[order(res$padj),]
res_ordered<- res_ordered[order(res_ordered$padj, decreasing = F),]
rownames(res_ordered)<-make.names(annotation$Symbol[match(rownames(res_ordered), annotation$GeneID)], unique=TRUE)
head(res_ordered)
write.table(res_ordered, "res_ordered",sep="\t")

vst<-vst(dds, blind=F)
normalised_vst<-assay(vst)
rownames(normalised_vst)<-make.names(annotation$Symbol[match(rownames(normalised_vst), annotation$GeneID)], unique=TRUE)
normalised_vst<-normalised_vst[,order(colnames(normalised_vst))]
write.table(normalised_vst, "normalised_vst.txt", cbind("NAME" = rownames(normalised_vst), normalised_vst),row.names = F,quote = F,sep = "\t")

What do you think, which is the most correct one? What can I do to improve?

Thanks a lot!

GSEA RNA-Seq voom vsn • 4.9k views

ADD COMMENT • link updated 5.6 years ago by Gordon Smyth ★ 8.3k • written 5.6 years ago by pennakiza ▴ 60

score 2 · Answer 1 · 2020-01-26

2

Entering edit mode

5.6 years ago

Gordon Smyth ★ 8.3k

I am the voom author, but I do not recommend voom for this purpose because GSEA is unable to use the voom precision weights. Just use:

y <- DGEList(counts ...)
y <- calcNormFactors(y)
GSEA_table <- cpm(y, log=TRUE)

You do not need filterByExpr here but, if you do use it, it should be used in conjunction with your treatment factor or design matrix.

As an alternative to GSEA, you might also consider the pathway analysis functions such as camera that come as part of limma.

ADD COMMENT • link 5.6 years ago by Gordon Smyth ★ 8.3k

0

Entering edit mode

Thank you Gordon! I will try camera too!

ADD REPLY • link 5.6 years ago by pennakiza ▴ 60

score 0 · Answer 2 · 2020-01-25

0

Entering edit mode

5.6 years ago

dsull ★ 7.6k

Look at the instructions here:

https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Using_RNA-seq_Datasets_with_GSEA

Using DESeq2 is fine (and other types of acceptable RNA-seq normalizations should also be fine).

You also may not need to vst-transform your normalized counts, per the following FAQ: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#Should_I_use_natural_or_log_scale_data_for_GSEA.3F

DESeq2 also gives you the log fold changes (treatment compared to control) of each gene. You can just use the log fold changes to rank your gene list and then feed that ranked gene list into GSEAPreranked algorithm. This is what I generally do. If you do this, I'd recommend using lfcShrink to get more accurate log fold change values.

ADD COMMENT • link 5.6 years ago by dsull ★ 7.6k

0

Entering edit mode

Thank you very much!

I thought that you needed to transform the data before but I will certainly try it without it. I have tried to used preranked but in some cases I have only a few DE genes (p-adj <0.05) and if I only use those I don't get any gene sets enriched.

ADD REPLY • link 5.6 years ago by pennakiza ▴ 60

0

Entering edit mode

Ah, for the prerank algorithm, you still use ALL your genes, not just your DE ones. So if 200 out of 40000 genes are DE, you still use all 40000 genes. Basically, you just supply a list of ALL genes ranked by their log fold change.

That's the power of GSEA -- you use every single gene and it's not based off of cutoffs/thresholds. Contrast this to gene ontology overrepresentation analysis where you only input a list of genes that are considered DE.

ADD REPLY • link 5.6 years ago by dsull ★ 7.6k

0

Entering edit mode

That's cool, in this case I am getting the same results, which is reassuring!

Thank you very much for your help!

ADD REPLY • link 5.6 years ago by pennakiza ▴ 60