Question

Pathway analysis on a intersection of gene lists

2

Entering edit mode

8.6 years ago

lcordeiro ▴ 40

I'm studying two cancers in the same tissue and wanted to identify common pathways in both diseases based on microarray and RNA-Seq expression data.

I could run separate enrichment and pathway analysis on each disease and simply observe which pathways are common.

My question is: is there a way to use the intersection of differentially expressed genes in each disease in any kind of enrichment analysis?

I tried running over-representation tests and GSEA on the intersection list but since it is quite small -- around 100 DEGs -- all results came out statistically insignificant. Perhaps I'm using the wrong background list? I've tried the background list obtained after employing typical non-specific filtering techniques. It works well for each individual disease but not with the intersection of GDEs.

Someone suggested that I use the union of GDEs as background but I couldn't find a strong rationale for that (or any good reference on this, for that matter.)

RNA-Seq microarray pathway analysis • 3.3k views

ADD COMMENT • link updated 8.6 years ago by ivivek_ngs ★ 5.2k • written 8.6 years ago by lcordeiro ▴ 40

score 0 · Answer 1 · 2016-04-21

0

Entering edit mode

8.6 years ago

ivivek_ngs ★ 5.2k

There are many caveats in what you are doing,

first you have to understand how many replicates you have for each of your platform.
Microarray and RNA-Seq are 2 different platforms and intersecting
them at gene lists for very few samples are not a good ploy to do.
If you have considerable number of samples in each platform and then you use the DE analysis and try to find the common , it might still
yield something.
I would prefer to first find DEGs individually from both platform,
use the same normalization methods for both. Instead of doing
intersection of genes since you have a too small number you can
simply do a pathway analysis of DEGs from both platform and select
the top pathways (significant to see if there is any intersection)
Alternatively you can also do GO ontology enrichment to see if top GO terms corresponds or not between the DEGs from 2 different platforms. If so then you can overlap the GO terms and make some conclusions.

There might be plenty of way to do this but ultimately its all about the hypothesis you want to set or things you want to convey.

Having said all this I would employ some conditional filtering of top DEGs from both plaform based on pvalue and FC enrichment and those which comes out from there should go for the above mentioned criteria's. Hope these shed some light.

ADD COMMENT • link 8.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Thanks for the quick reply. For now I'm not mixing these two types of data, yet. This will certainly be another issue later on.

Sample size shouldn't be much of an issue as I have more than 100 samples in each group (disease 1, adjacent tissue 1, disease 2, adjacent tissue 2). I mean, sample size is ALWAYS an issue but 400 data points is not too bad for microarray data, I guess...

The point now is really to know (or to be more confident about) if there is a way to use the intersection DEG list directly in any kind of enrichment analysis and, if so, which background to use.

ADD REPLY • link 8.6 years ago by lcordeiro ▴ 40

0

Entering edit mode

In that case you have take into account the background of both microarray(probes to genes) and RNASeq (genes) and then create a matrix that has genes that are expressed for both the platforms and then separate them according to platform and perform differential analysis on them. So in that thats you will have same background of genes coming from both background on which you have done DEA. Obviously you might lose out on some genes but then since you have considerable number of data points so that should not be a concern. In short lets have if you create a expression matrix from microarray and map them to gene symbols and do that same for RNASeq (mapping read counts to gene symbols) , then you can might have m number of genes for microarray and n number of genes for RNASeq. You can overlap them to see how much overlap . If its a high overlap and significant then create a expression matrix for mxn genes for both microarray and RNASeq and run DEA on them separately and do the above mentioned. In this case your background is same for both the platforms, you might lose some important genes as well but your outlook is to see most DEGs captured unanimously by both platforms, so some kind or enrichment will be there. So whatever DEGs you get from each DEA you can run GO enrichment or GSEA separately for each platform having the background same for both.

ADD REPLY • link 8.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Thanks again. I think I wasn't very clear on my first comment, sorry for that. At this moment, I'm not integrating microarray data with RNA-Seq data; I'm only analyzing microarray data for now.

To recap: I know how to do/run over-representation and GSEA analysis on each single dataset. That is fine. What I wanted to do is to try and find common altered pathways in both cancers; finding, for instance, that both cancers have an upregulated MAPK pathway,

I believe that we could do this two ways: (1) doing functional annotation at each cancer individually and then comparing altered pathways and (2) comparing GDEs between the two conditions and THEN running functional annotation on the intersection of GDEs.

I would like to hear from you all whether the second approach makes any sense, and, if so, which background list should I use.

Ciao,

Leonardo

ADD REPLY • link 8.6 years ago by lcordeiro ▴ 40

0

Entering edit mode

ah ok , I was thinking that you are trying to do a kind of meta-analysis. Fair enough. It was not clear. Let me clear myself so now you are analyzing Micro-array data of two different tumors between control vs diseased (is that so?) then after performing normalization for

1) First tumor (control vs disease) you will have an expression matrix that you can derive from limma ( with gene names as row and expression values as colmuns for your samples both control and diseased) . lets say m rows and n columns

2) You will get the same thing in for your second type of tumor. x rows genes and y samples (columns)

3) Now you can take an overlap of m and x rows of genes to see which are actually the expressed genes in both the diseases and fetch the gene ids and compute matrix of expression across all samples separately for each each disease with its corresponding controls.

4) The rows of genes you get now in 3) are common gene ids in both different types of tumors that can be used as background for GSEA.

5) I would in fact prefer to also apply differential expression on same background so that I can control also for the false positives in DEGs , since you will try to use same genes that are expressed in 2 different cancers and apply DE analysis on them to get GDEs (as you say) . In doing so you will have a considerable less number of expressed genes as background in both the tumors but your GDEs will be specific to backgrounds of genes that are common in both the tumors. Once you perform the DE analysis and extract GDE list separately you can use them for GSEA with your background that was common to both.

let me know if it is clear to you or not.

ADD REPLY • link 8.6 years ago by ivivek_ngs ★ 5.2k