I have a list of genes that showed up as differentially expressed between patients and controls. I have seen in papers, people performing analyses to see what fraction of genes in a list are related to cancer, using "Ingenuity Pathway Analysis" etc.
I have been unable to use this software in that manner, though, and wanted to ask if anyone is familiar with another way to accomplish this?
Thank you! I have used DAVID before.. but could I input a set of human genes and determine the percent that may be related to cancer in humans? I don't know of a way to do that in DAVID.
You can play with Functional Annotation there. Add your gene list as
OFFICIAL_GENE_SYMBOL
, then for example go to Pathways -> KEGG_PATHWAY in "Annotation Summary Results" section and click chart to see if any onco-pathways are enriched.You can also do this in an unsupervised manner: click on "Functional Annotation Clustering" in "Annotation Summary Results" section and top-enriched annotation clusters will appear. Those are created using similarity in various annotation terms, e.g. GO category cell-cycle is somewhat associated with KEGG pathways in cancer, etc. So, hopefully, among top enriched cluster you'll get annotation categories enriched in oncogenesis.
This is pretty incredible. Thank you! I input 263 genes into DAVID, and did the Pathways --> KEGG_PATHWAY. I got 15 genes listed: 8 were in "pathways in cancer", 4 in "prostate cancer", and 3 in "notch signaling pathway". I am not sure how to determine if this is all just false discovery rate, or significant though. Since 12 cancer-related genes /263 could easily be due to false positives? Indeed, the Benjamini values listed are large (0.99), but I don't know how that is calculated.
Yep this more looks like false-positives. You can try exploring other annotations. You can also try to make your differential expression criteria more stringent. If the total number of genes decreases and those 12 genes remain, this could indicate that they're true positives :)
Thanks for all your insight!!
Using the
camera
and(m)roast
functions in the edgeR and limma (after applyingvoom
) packages is another (principled) way to do GSEA-like analyses on RNA-seq dataThanks for the point! Could those be used in case there are no biological replicas (only conditions) and DE gene set couldn't be computed?
You don't need to perform a differential expression analysis prior to running these GSEA analyses, however you do need to fit a linear model to your data & design and (almost certainly) will need some replication somewhere.
I am actually struggling with the input to GSEA to analyze RNA-seq data. Could you please explain in detail about log2-transformed FPKM values(obtained from cufflink) as input to GSEA, as mentioned above.