Hi all,
I have done GSEA pre-ranked on differentially expressed genes between tumors and normals. The genes have been ordered by log fold change since GSEA pre-ranked requires an ordered list of genes.
The results are outputted in the format with phenotypes as na_pos
and na_neg
. I am not sure about these phenotypes as how does it differentiate between the 2 phenotypes based on an ordered list of genes.
I understand that the normal GSEA when run on tumors and normals expression values gives us the output between these two phenotypes, but not sure what does the phenotype mean in GSEA pre-ranked.
The fold change values are both positive and negative which are an input along with the gene symbols to GSEA pre-ranked in my case.
-Ron
The direction of gene expression is critical. It is very likely that you have different gene sets in wither direction.
Ranking by significance is better than fold change; just think about those lowly expressed genes with extreme fold changes and high p-values. Checkout this NAR which discusses the difference.
http://nar.oxfordjournals.org/content/38/17/e169.long
You can generate a rank file with a simple awk script.
http://genomespot.blogspot.com.au/2015/01/how-to-generate-rank-file-from-gene.html
I am wondering if ranking genes by p-values gives rise to another nasty bias: genes with higher read counts (either because they are large or more highly expressed) yield lower p-values, simply because any statistical test will have more power with a larger number of reads. This would lead to artificially high enrichments of gene sets containing either large or highly expressed genes (or both). From my own GSEA results using p-value pre-ranked gene lists, I think that I indeed observe this trend, although I do not have hard data yet.
Thus, both solutions -- ranking by fold change and by p-value -- are probably not perfect. Any suggestions to do better?
I agree these options are both not ideal, but significance based ranking at least has some statistical basis. Fold change is too susceptible to noise for lowly expressed genes. Another approach could be to rank based on the lower confidence interval of the fold change. These all need to be baked off IMO.
That sounds worth a try. How would you compute CI for RNA-seq fold changes? I have not seen them in the output of e.g. EdgeR or DESeq2.
Some thoughts from Gordon Smyth on the issue. Very useful.
https://support.bioconductor.org/p/61640/