For GSEA, check the example file formats to get an idea of the formatting. I recently used the JAVA implementation of GSEA for the first time and got it working.
cls file
Contains information on factors in our data. 35 7
means, in this case, 35 samples and 7 unique levels for the listed factor. On the third line of the file, we list the actual levels as they relate to our samples - these should line up to the columns in the gct file.
NB - these are space-delimited.
35 7 1
# d0 d1 d2 d4 d6 d8 d10
d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10
gct file
This contains the expression values. You need a NAME
and DESCRIPTION
column before the counts values actually start. Description can be just na
. Again, note the header information, here, 18062 genes X 35 samples.
NB - these are tab-delimited.
#1.0
18062 35
NAME DESCRIPTION Day 0, rep 1 Day 1, rep 1 Day 2, rep 1 Day 4, rep 1 Day 6, rep 1 Day 8, rep 1 Day 10, rep 1 Day 0, rep 2 Day 1, rep 2 Day 2, rep 2 Day 4, rep 2 Day 6, rep 2 Day 8, rep 2 Day 10, rep 2 Day 0, rep 3 Day 1, rep 3 Day 2, rep 3 Day 4, rep 3 Day 6, rep 3 Day 8, rep 3 Day 10, rep 3 Day 0, rep 4 Day 1, rep 4 Day 2, rep 4 Day 4, rep 4 Day 6, rep 4 Day 8, rep 4 Day 10, rep 4 Day 0, rep 5 Day 1, rep 5 Day 2, rep 5 Day 4, rep 5 Day 6, rep 5 Day 8, rep 5 Day 10, rep 5
A1BG na -1.78750107249577 -1.78731965121805 -1.78739011815182 -1.78648292007421 -1.78825323052185 -1.75670265819045 -1.7856669206048 -1.78652518885366 -1.78682730267777 -1.78980334199807 -1.78644486265833 -1.7868860041479 -1.78844156465141 -1.78740712853483 -1.75644423399062 -1.78612773069836 -1.78929036918159 -1.78723396224438 -1.76697481762272 -1.78693195908128 -1.78629510548009 -1.78470994669637 -1.78615883408804 -1.75804087324122 -1.78652254894815 -1.78711039289089 -1.76833202023458 -1.78672978697874 -1.7850823437463 -1.78625577998891 -1.78670342516185 -1.78584154361388 -1.78728728194433 -1.78497558588491 -1.78644925915904
A1CF na 1.68492754186313 1.54066315490874 1.54006231864025 1.51816007039476 1.60513517299563 1.5837019048566 1.61600434016912 1.51769932951262 1.60421752506403 1.56906960878706 1.65730147755638 1.57148034912919 1.64703379520972 1.54022084471361 1.61967950619213 1.51949572547524 1.52562157884476 1.540660774612 1.54957287190596 1.48702357593441 1.54796402052754 1.59524718481615 1.48932230313822 1.60079524224128 1.75736087058801 1.51447655944983 1.61715833564219 1.60452069557156 1.52619397748714 1.48902853362178 1.57432099780454 1.64145506694909 1.56773033915297 1.52760402017735 1.65905159731629
gmt file
Contains the signatures:
GO_CELL_REDOX_HOMEOSTASIS http://software.broadinstitute.org/gsea/msigdb/cards/GO_CELL_REDOX_HOMEOSTASIS.html PDIA6 TXNDC9 GLRX3 PRDX4 TXNRD2 PDIA5 EGLN2 TXNRD3 AIFM3 CYBA CYBB DDIT3 QSOX2 DLD PDILT ERP44 DNAJC16 NNT TXNDC8 TXN2 GCLC GLRX GPX1 PDIA3 GSR ERO1L APEX1 NME9 IL6 GRXCR1 LTF NCF2 NCF4 NFE2L2 NOS1 NOS2 NOS3 P4HB GLRX2 TXNDC12 TXNDC11 TMX2 GLRX5 TXNDC3 DNAJC10 TMX3 SELS TMX4 ERO1LB TXNDC16 QSOX1 PDIA2 NCF1 SLC11A1 TXN TXNRD1 TXNDC15 PTGES2 TMX1 TXNDC5 CAMP SH3BGRL3 TXNDC2 KRIT1 AIFM1 TXNL1 PDIA4
GO_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_IN_RESPONSE_TO_ENDOPLASMIC_RETICULUM_STRESS http://software.broadinstitute.org/gsea/msigdb/cards/GO_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_IN_RESPONSE_TO_ENDOPLASMIC_RETICULUM_STRESS.html CASP12 CEBPB DAB2IP DDIT3 ERN1 PPP1R15A BBC3 GSK3B ERO1L UBE2K APAF1 ITPR1 MAP3K5 ATF4 ATP2A1 PMAIP1 PML DNAJC10 TRIB3 BAK1 BAX SELK BCL2 TMBIM6 TRAF2 XBP1 CHAC1 BAG6 CASP4 TNFRSF10B BRSK2 AIFM1
NB - these are tab-delimited.
----------------------------------------
Kevin
Thank you @Kevin. I am new for this kind of analysis. In total I have 8 samples (4 treated and 4 untreated) with 3 replicates.
I am confused now which expression values I have to give in the gct file? Please help me in this regard.
Hey, you should go one step more to produce the rlog or vst counts, and then use those in the gct file.
Thank you. I have used this code:
I obtained this file:
Should I use these values? Also I don't know how to create gmt file. Thank you
Thank you, but remember that you require this format:
You need an extra column for
DESCRIPTION
Thank you Kevin, It is working now. I have downloaded the gene data sets files for Arabidopsis thaliana from the website enter link description here. Is this right to use that? It shows error when the GMT formatted file for all gene sets is uploaded. But works well when some of individual data sets are uploaded.
The GMT files through that link that you posted do not look correct, to be honest. Take a look at the format and compare to the one that I posted, above.
Are you using GSEA JAVA version from the command line?
I am not using the command line but GSEA Desktop Application. The format of these files looks different from the one you posted. I do not know from which source I can get the gene data sets for Arabidopsis. I could not find Arabidopsis on gsea/msigdb. Could you please suggest some link? Other thing I would like to be clear, I am using ATH1_121501.chip for Chip platform in GSEA analysis. Is this the right chip to use for Arabidopsis thaliana plant RNA-seq data?
I think that it may involve some searching. For example, I found this and they gene sets are in the correct format: http://www.go2msig.org/cgi-bin/prebuilt.cgi?taxid=3702
It is also easy to create custom gene sets.