Question

How to perform Gene Set Enrichment Analysis

0

Entering edit mode

9.8 years ago

Mo ▴ 920

Dear all,

I have three cells with many genes (for example one of my data is as follows)

 Gene Name     Drug1         Drug2       Drug3
1007_s_at         -0.2815    -0.2032    -0.2539
1053_at              -0.0113    0.0285    -0.0675
117_at                 -0.0448    -0.136    -0.2189
121_at                  -0.081    0.1412    0.0464

Based on my search I found I should obtain t-test, then p-value etc. There are many functions in r as well as Java which can be used for GSEA. However, I am stuck at the first step, how to prepare the data set and then how to analysis them? I don't mind to analysis with any software available, just can you please one of you help me how to do it?

I am looking forward to hearing from you

gene-set-entrichment-analysis • 6.9k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Mo ▴ 920

0

Entering edit mode

So, you are working with arabidopsis, use affymetrix gene chips, and have no replications, right?

There is no way to get p-values for gene ranking, I think the most important 'pre-processing' step is to get the raw data with biological replicates.

ADD REPLY • link 9.8 years ago by Michael 55k

0

Entering edit mode

No this is not the arabdopsis. I don't have replication for a cell but the same drug and the same gene coming from three different cells.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Mo ▴ 920

1

Entering edit mode

So you have biological replicates, that is contradictory to you example, please be more precise with examples. You are not giving enough details, what is your organism then? This is important because that determines where to look for GO annotation.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Michael 55k

0

Entering edit mode

I have the info for three cells of animal liver. I only gave an example matrix to show how they look like. One can then imagine I have three matrix the same as above example

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Mo ▴ 920

score 2 · Answer 1 · 2015-01-31

2

Entering edit mode

9.8 years ago

al3n70rn ▴ 110

Have a look to GSEA documentation, is really straightforward:

http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp

http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

ADD COMMENT • link 9.8 years ago by al3n70rn ▴ 110

Ram · Answer 2 · 2015-01-31

1

Entering edit mode

9.8 years ago

dago ★ 2.8k

The format of your data really depends from the program you are going to use.If you want to perform a GSA using GO you need to:

annotate your gene with GO terms
define a testing group, which I guess is the list of gene your are referring to, and a background group, which you use as "comparison" term

Look previous posts for more details, the procedure us the same for different organisms:

How can I do GO enrichment analysis for bacteria genome? (biomaRt is not support bacteria anymore)

Then below, few papers on the topic:

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by dago ★ 2.8k

0

Entering edit mode

Thanks for your answer. I definitely agree that each software/package needs a specific way of data structure and of course uses different strategy than that of another one. However, here my question is that lets say a cell with many gene type in a cell (as shown above) and I want to perform the GSEA. I don't want to make a test group and a background group (classify them myself) because I don't have any clue which genes are significantly differentiate from another one for a given biological activity/ question. Here, I say, I don't know anything about any specific gene and how to perform such analysis ? (some might say lets get the mean for each column (drug) then based on that perform such analysis! I don't know I am looking to find what people think and how I can perform such analysis on this example set?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Mo ▴ 920

1

Entering edit mode

Well, I do not quite get what you want. If you do not anything about your genes, you can just look for co-expression patterns. I would say you could look to correlations in the Drug_groups. Otherwise you could look for significant differences of expression between groups.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by dago ★ 2.8k

0

Entering edit mode

I think both correlation and significant differences of expression between groups would make some sense and good to practice. Can you please tell me how to do it? Simply perform a correlation coefficient?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Mo ▴ 920

Ram · Answer 3 · 2015-01-31

1

Entering edit mode

9.8 years ago

Michael 55k

I think the difficulty is you didn't really think that through, and you seem to be lacking an experimental question, if you in fact have a certain question to your data, then you should also make it explicit. It is not a good idea to pick an analysis approach first and make the question and data fit somehow, this is sort of the 'opposite' of a scientific method in my opinion.

Gene Set enrichment analysis needs gene sets, well that's obvious, but it is hard to define sensible sets without an experimental question. If you don't have any good hypothesis, than you might try GO term enrichment test instead of GSEA.

GSEA assumes that the genes can be ordered by a value (low to high), differential expression values might be used for this. An interesting contrast could be drug ~ control, and test if a set of known cancer-associated genes is enriched (this is just an example). Remember that in case you cannot come up with any sensible gene set, then GSEA is not suitable.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Michael 55k

0

Entering edit mode

Thanks for your comment. For sure I do have a question in my strategy. My question is to see which genes are up/down regulated

However, If I am going to give something like 0000 1111 then what is the point to do such analysis? If I know this in advance then why one need to perform analysis? I want to see whether I can find this discrimination based on available data set or not?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Mo ▴ 920

1

Entering edit mode

So you want to do differential expression analysis -> look at the limma Bioconductor package and GSEA doesn't apply here.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Michael 55k