cls file

Question

Running GSEA for DEGs

0

Entering edit mode

5.8 years ago

umeshtanwar2 ▴ 30

Hi all,

I am working on RNA-seq data of Arabidopsis plant. I have a list of Deferentially Expressed Genes obtained from DESeq2. Now I would like to do further analysis using GSEA. I prepared the rank file by using the code below:

x <- read.csv("DEG-B_vs_A.csv")
attach(x)
x$fcSign=sign(log2FoldChange)
x$logP=-log10(pvalue)
x$metric= x$logP/x$fcSign
y<-x[,c("Gene", "metric")]
head(y)`
write.table(y,file="DE_genes.rnk",quote=F,sep="\t",row.names=F)

This rnk file looks like this:

   Gene   metric
  
1 AT1G01060 2.743735 2 AT1G01100 3.372178 3 AT1G01180 -3.800650 4 AT1G01240 -7.978849 5 AT1G01280 3.131964

1.) Now when I try to load this file on GSEA 3.0, it shows some error:

There were errors: ERROR(S) #:1 Parsing trouble

Could you please help me in this regard?

2.) I also tried to use this file on WebGestalt link. It shows the error:

ERROR: All IDs in the uploaded list are not annotated to any category of the functional database.

R error (1555065730).

I am using the default parameters and I am selecting genesymbol for Select Gene ID Type in input file. Can you please tell me if the code I used for .rnk file is right or I did some mistake. Any help from you will be very helpful for me.

Thank you

RNA-Seq next-gen • 5.8k views

ADD COMMENT • link updated 5.8 years ago by Prakash ★ 2.2k • written 5.8 years ago by umeshtanwar2 ▴ 30

score 4 · Answer 1 · 2019-04-12

4

Entering edit mode

5.8 years ago

Kevin Blighe 88k

For GSEA, check the example file formats to get an idea of the formatting. I recently used the JAVA implementation of GSEA for the first time and got it working.

cls file

Contains information on factors in our data. 35 7 means, in this case, 35 samples and 7 unique levels for the listed factor. On the third line of the file, we list the actual levels as they relate to our samples - these should line up to the columns in the gct file.

NB - these are space-delimited.

35 7 1
# d0 d1 d2 d4 d6 d8 d10
d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10

gct file

This contains the expression values. You need a NAME and DESCRIPTION column before the counts values actually start. Description can be just na. Again, note the header information, here, 18062 genes X 35 samples.

NB - these are tab-delimited.

#1.0
18062   35
NAME    DESCRIPTION Day 0, rep 1    Day 1, rep 1    Day 2, rep 1    Day 4, rep 1    Day 6, rep 1    Day 8, rep 1    Day 10, rep 1   Day 0, rep 2    Day 1, rep 2    Day 2, rep 2    Day 4, rep 2    Day 6, rep 2    Day 8, rep 2    Day 10, rep 2   Day 0, rep 3    Day 1, rep 3    Day 2, rep 3    Day 4, rep 3    Day 6, rep 3    Day 8, rep 3    Day 10, rep 3   Day 0, rep 4    Day 1, rep 4    Day 2, rep 4    Day 4, rep 4    Day 6, rep 4    Day 8, rep 4    Day 10, rep 4   Day 0, rep 5    Day 1, rep 5    Day 2, rep 5    Day 4, rep 5    Day 6, rep 5    Day 8, rep 5    Day 10, rep 5
A1BG    na  -1.78750107249577   -1.78731965121805   -1.78739011815182   -1.78648292007421   -1.78825323052185   -1.75670265819045   -1.7856669206048    -1.78652518885366   -1.78682730267777   -1.78980334199807   -1.78644486265833   -1.7868860041479    -1.78844156465141   -1.78740712853483   -1.75644423399062   -1.78612773069836   -1.78929036918159   -1.78723396224438   -1.76697481762272   -1.78693195908128   -1.78629510548009   -1.78470994669637   -1.78615883408804   -1.75804087324122   -1.78652254894815   -1.78711039289089   -1.76833202023458   -1.78672978697874   -1.7850823437463    -1.78625577998891   -1.78670342516185   -1.78584154361388   -1.78728728194433   -1.78497558588491   -1.78644925915904
A1CF    na  1.68492754186313    1.54066315490874    1.54006231864025    1.51816007039476    1.60513517299563    1.5837019048566 1.61600434016912    1.51769932951262    1.60421752506403    1.56906960878706    1.65730147755638    1.57148034912919    1.64703379520972    1.54022084471361    1.61967950619213    1.51949572547524    1.52562157884476    1.540660774612  1.54957287190596    1.48702357593441    1.54796402052754    1.59524718481615    1.48932230313822    1.60079524224128    1.75736087058801    1.51447655944983    1.61715833564219    1.60452069557156    1.52619397748714    1.48902853362178    1.57432099780454    1.64145506694909    1.56773033915297    1.52760402017735    1.65905159731629

gmt file

Contains the signatures:

GO_CELL_REDOX_HOMEOSTASIS   http://software.broadinstitute.org/gsea/msigdb/cards/GO_CELL_REDOX_HOMEOSTASIS.html PDIA6   TXNDC9  GLRX3   PRDX4   TXNRD2  PDIA5   EGLN2   TXNRD3  AIFM3   CYBA    CYBB    DDIT3   QSOX2   DLD PDILT   ERP44   DNAJC16 NNT TXNDC8  TXN2    GCLC    GLRX    GPX1    PDIA3   GSR ERO1L   APEX1   NME9    IL6 GRXCR1  LTF NCF2    NCF4    NFE2L2  NOS1    NOS2    NOS3    P4HB    GLRX2   TXNDC12 TXNDC11 TMX2    GLRX5   TXNDC3  DNAJC10 TMX3    SELS    TMX4    ERO1LB  TXNDC16 QSOX1   PDIA2   NCF1    SLC11A1 TXN TXNRD1  TXNDC15 PTGES2  TMX1    TXNDC5  CAMP    SH3BGRL3    TXNDC2  KRIT1   AIFM1   TXNL1   PDIA4
GO_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_IN_RESPONSE_TO_ENDOPLASMIC_RETICULUM_STRESS    http://software.broadinstitute.org/gsea/msigdb/cards/GO_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_IN_RESPONSE_TO_ENDOPLASMIC_RETICULUM_STRESS.html  CASP12  CEBPB   DAB2IP  DDIT3   ERN1    PPP1R15A    BBC3    GSK3B   ERO1L   UBE2K   APAF1   ITPR1   MAP3K5  ATF4    ATP2A1  PMAIP1  PML DNAJC10 TRIB3   BAK1    BAX SELK    BCL2    TMBIM6  TRAF2   XBP1    CHAC1   BAG6    CASP4   TNFRSF10B   BRSK2   AIFM1

NB - these are tab-delimited.

----------------------------------------

Kevin

ADD COMMENT • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you @Kevin. I am new for this kind of analysis. In total I have 8 samples (4 treated and 4 untreated) with 3 replicates.

countdata <- read.table("NewTotalCounts.txt", header=TRUE, row.names=1)
countdata <- countdata[ ,6:ncol(countdata)]
countdata <- as.matrix(countdata)
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~genotype*treatment)
dds
dds <- DESeq(dds)
matrix(resultsNames(dds))
"1" "Intercept"
"2" "genotype_C_vs_A"   
"3" "genotype_D_vs_A"
"4" "genotype_B_vs_A"
"5" "treatment_Post.HL_vs_Mock"
"6" "genotypeC.treatmentPost.HL"
"7" "genotypeD.treatmentPost.HL"
"8" "genotypeB.treatmentPost.HL"

I am confused now which expression values I have to give in the gct file? Please help me in this regard.

ADD REPLY • link 5.8 years ago by umeshtanwar2 ▴ 30

1

Entering edit mode

Hey, you should go one step more to produce the rlog or vst counts, and then use those in the gct file.

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you. I have used this code:

rld <- rlogTransformation(dds)
head(assay(rld))
hist(assay(rld))
write.table(assay(rld), file="gsea.txt")

I obtained this file:

> "A_m1" "A_m2" "A_m3" "A_t1" "A_t2" "A_t3" "B_m1" "B_m2" "B_m3" "B_t1" "B_t2" "B_t3" "C_m1" "C_m2" "C_m3" "C_t1" "C_t2" "C_t3" "D_m1" "D_m2" "D_m3" "D_t1" "D_t2" "D_t3"
"AT1G01010" 7.62881756293462 7.61583785579526 7.50395202025519 8.0461501392717 8.07839347717189 7.8935638435766 7.3796683319956 7.43183898570422 7.34773443084268 7.70575244915009 7.95364985067999 7.87739554627091 7.45420158450824 7.38007181030954 7.42630386340081 8.92908319872127 8.96033513023244 9.11128700629371 7.60061777221793 7.58291015289311 7.6362555930631 8.86572624373881 8.92296240402509 9.18643861208017
"AT1G01020" 8.14128901967722 8.18741899908557 8.14540372194814 8.15583644936684 8.15820207932771 8.25324372705063 8.2135967684159 8.27336849710576 8.28136628735159 8.23387984635743 8.14470554657358 8.2677865861621 8.19538568421793 8.21638435731831 8.10344524130151 8.31758011042932 8.35093130092293 8.47359297007998 8.27093500637238 8.26539640802531 8.15150104583513 8.35192935336261 8.37768591750274 8.38023709497797
"AT1G03987" -1.21011607165463 -1.28992405272131 -1.28358871546608 -1.28437100997607 -1.28687500635835 -1.290740685219 -1.2834693985665 -1.27726214927632 -1.28609987502108 -1.29397158956222 -1.04468043552706 -1.13979205707524 -1.28435762418637 -1.28924192870063 -1.28828902166023 -1.28550743082781 -1.12011728153111 -1.28054964672509 -1.27720249885525 -1.20581307309493 -1.28256078478735 -1.20218440774749 -1.28946362827518 -1.274638

Should I use these values? Also I don't know how to create gmt file. Thank you

ADD REPLY • link 5.8 years ago by umeshtanwar2 ▴ 30

1

Entering edit mode

Thank you, but remember that you require this format:

#1.0
18062   35
NAME    DESCRIPTION Day 0, rep 1    Day 1, rep 1    Day 2, rep 1    Day 4, rep 1    Day 6, rep 1    Day 8, rep 1    Day 10, rep 1   Day 0, rep 2    Day 1, rep 2    Day 2, rep 2    Day 4, rep 2    Day 6, rep 2    Day 8, rep 2    Day 10, rep 2   Day 0, rep 3    Day 1, rep 3    Day 2, rep 3    Day 4, rep 3    Day 6, rep 3    Day 8, rep 3    Day 10, rep 3   Day 0, rep 4    Day 1, rep 4    Day 2, rep 4    Day 4, rep 4    Day 6, rep 4    Day 8, rep 4    Day 10, rep 4   Day 0, rep 5    Day 1, rep 5    Day 2, rep 5    Day 4, rep 5    Day 6, rep 5    Day 8, rep 5    Day 10, rep 5
A1BG    na  -1.78750107249577   -1.78731965121805   -1.78739011815182   -1.78648292007421   -1.78825323052185   -1.75670265819045   -1.7856669206048    -1.78652518885366   -1.78682730267777   -1.78980334199807   -1.78644486265833   -1.7868860041479    -1.78844156465141   -1.78740712853483   -1.75644423399062   -1.78612773069836   -1.78929036918159   -1.78723396224438   -1.76697481762272   -1.78693195908128   -1.78629510548009   -1.78470994669637   -1.78615883408804   -1.75804087324122   -1.78652254894815   -1.78711039289089   -1.76833202023458   -1.78672978697874   -1.7850823437463    -1.78625577998891   -1.78670342516185   -1.78584154361388   -1.78728728194433   -1.78497558588491   -1.78644925915904
A1CF    na  1.68492754186313    1.54066315490874    1.54006231864025    1.51816007039476    1.60513517299563    1.5837019048566 1.61600434016912    1.51769932951262    1.60421752506403    1.56906960878706    1.65730147755638    1.57148034912919    1.64703379520972    1.54022084471361    1.61967950619213    1.51949572547524    1.52562157884476    1.540660774612  1.54957287190596    1.48702357593441    1.54796402052754    1.59524718481615    1.48932230313822    1.60079524224128    1.75736087058801    1.51447655944983    1.61715833564219    1.60452069557156    1.52619397748714    1.48902853362178    1.57432099780454    1.64145506694909    1.56773033915297    1.52760402017735    1.65905159731629

You need an extra column for DESCRIPTION

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you Kevin, It is working now. I have downloaded the gene data sets files for Arabidopsis thaliana from the website enter link description here. Is this right to use that? It shows error when the GMT formatted file for all gene sets is uploaded. But works well when some of individual data sets are uploaded.

ADD REPLY • link 5.8 years ago by umeshtanwar2 ▴ 30

1

Entering edit mode

The GMT files through that link that you posted do not look correct, to be honest. Take a look at the format and compare to the one that I posted, above.

Are you using GSEA JAVA version from the command line?

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

I am not using the command line but GSEA Desktop Application. The format of these files looks different from the one you posted. I do not know from which source I can get the gene data sets for Arabidopsis. I could not find Arabidopsis on gsea/msigdb. Could you please suggest some link? Other thing I would like to be clear, I am using ATH1_121501.chip for Chip platform in GSEA analysis. Is this the right chip to use for Arabidopsis thaliana plant RNA-seq data?

ADD REPLY • link 5.8 years ago by umeshtanwar2 ▴ 30

0

Entering edit mode

I think that it may involve some searching. For example, I found this and they gene sets are in the correct format: http://www.go2msig.org/cgi-bin/prebuilt.cgi?taxid=3702

It is also easy to create custom gene sets.

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

score 3 · Answer 2 · 2019-04-12

3

Entering edit mode

5.8 years ago

Prakash ★ 2.2k

I don't think the gene identifier you are using is gene symbol for your organism of interest. could you please convert it to official gene symbol and try it again.

ADD COMMENT • link 5.8 years ago by Prakash ★ 2.2k

0

Entering edit mode

looks like Arabidopsis

ADD REPLY • link 5.8 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Yes, its Arabidopsis.

ADD REPLY • link 5.8 years ago by umeshtanwar2 ▴ 30

0

Entering edit mode

Thank you for the response. These are the official gene symbols for Arabidopsis.

ADD REPLY • link 5.8 years ago by umeshtanwar2 ▴ 30