I'm studying two cancers in the same tissue and wanted to identify common pathways in both diseases based on microarray and RNA-Seq expression data.
I could run separate enrichment and pathway analysis on each disease and simply observe which pathways are common.
My question is: is there a way to use the intersection of differentially expressed genes in each disease in any kind of enrichment analysis?
I tried running over-representation tests and GSEA on the intersection list but since it is quite small -- around 100 DEGs -- all results came out statistically insignificant. Perhaps I'm using the wrong background list? I've tried the background list obtained after employing typical non-specific filtering techniques. It works well for each individual disease but not with the intersection of GDEs.
Someone suggested that I use the union of GDEs as background but I couldn't find a strong rationale for that (or any good reference on this, for that matter.)
Thanks for the quick reply. For now I'm not mixing these two types of data, yet. This will certainly be another issue later on.
Sample size shouldn't be much of an issue as I have more than 100 samples in each group (disease 1, adjacent tissue 1, disease 2, adjacent tissue 2). I mean, sample size is ALWAYS an issue but 400 data points is not too bad for microarray data, I guess...
The point now is really to know (or to be more confident about) if there is a way to use the intersection DEG list directly in any kind of enrichment analysis and, if so, which background to use.
In that case you have take into account the background of both microarray(probes to genes) and RNASeq (genes) and then create a matrix that has genes that are expressed for both the platforms and then separate them according to platform and perform differential analysis on them. So in that thats you will have same background of genes coming from both background on which you have done DEA. Obviously you might lose out on some genes but then since you have considerable number of data points so that should not be a concern. In short lets have if you create a expression matrix from microarray and map them to gene symbols and do that same for RNASeq (mapping read counts to gene symbols) , then you can might have
m
number of genes for microarray andn
number of genes for RNASeq. You can overlap them to see how much overlap . If its a high overlap and significant then create a expression matrix formxn
genes for both microarray and RNASeq and run DEA on them separately and do the above mentioned. In this case your background is same for both the platforms, you might lose some important genes as well but your outlook is to see most DEGs captured unanimously by both platforms, so some kind or enrichment will be there. So whatever DEGs you get from each DEA you can run GO enrichment or GSEA separately for each platform having the background same for both.Thanks again. I think I wasn't very clear on my first comment, sorry for that. At this moment, I'm not integrating microarray data with RNA-Seq data; I'm only analyzing microarray data for now.
To recap: I know how to do/run over-representation and GSEA analysis on each single dataset. That is fine. What I wanted to do is to try and find common altered pathways in both cancers; finding, for instance, that both cancers have an upregulated MAPK pathway,
I believe that we could do this two ways: (1) doing functional annotation at each cancer individually and then comparing altered pathways and (2) comparing GDEs between the two conditions and THEN running functional annotation on the intersection of GDEs.
I would like to hear from you all whether the second approach makes any sense, and, if so, which background list should I use.
Ciao,
Leonardo
ah ok , I was thinking that you are trying to do a kind of meta-analysis. Fair enough. It was not clear. Let me clear myself so now you are analyzing Micro-array data of two different tumors between control vs diseased (is that so?) then after performing normalization for
1) First tumor (control vs disease) you will have an expression matrix that you can derive from limma ( with gene names as row and expression values as colmuns for your samples both control and diseased) . lets say
m
rows andn
columns2) You will get the same thing in for your second type of tumor.
x
rows genes andy
samples (columns)3) Now you can take an overlap of
m
andx
rows of genes to see which are actually the expressed genes in both the diseases and fetch the gene ids and compute matrix of expression across all samples separately for each each disease with its corresponding controls.4) The rows of genes you get now in 3) are common gene ids in both different types of tumors that can be used as background for GSEA.
5) I would in fact prefer to also apply differential expression on same background so that I can control also for the false positives in DEGs , since you will try to use same genes that are expressed in 2 different cancers and apply DE analysis on them to get GDEs (as you say) . In doing so you will have a considerable less number of expressed genes as background in both the tumors but your GDEs will be specific to backgrounds of genes that are common in both the tumors. Once you perform the DE analysis and extract GDE list separately you can use them for GSEA with your background that was common to both.
let me know if it is clear to you or not.