The Gene Set Enrichment Algorithm, outlined in this paper, http://www.broadinstitute.org/gsea/doc/subramanian_tamayo_gsea_pnas.pdf, refers often to a "random walk" used to traverse the ranked list L of gene-to-phenotype correlations.
However, what they actually do in the paper does not look like a random walk at all. It seems to me that they traverse the ranked list L sequentially, from rank 1 (highest correlation) onwards.
I was wondering if anyone could clear up the confusion of what they mean by "random walk", and why they use the term, when really it looks like they are doing a sequential walk, quite the opposite.
Also, as a follow-up question, how is it that they do not bias the top of the ranked list L
over the bottom? If we assume for the moment that they are doing a sequential walk, which seems to be the case, then the gene sets found at the bottom extreme will have a larger value for P_miss
, since P_miss
is proportional to i
. As a consequence, they will have smaller enrichment scores.
Perhaps this is related to the question above, since a sequential walk does not seem to work here...
I appreciate any help... I suspect I am not understanding something correctly...
Hey, thanks! I think this article made it clear. They are comparing the supremum (ES) with what it would be for a random walk... gene sets found at the top or the bottom will have a higher ES, and gene sets that are randomly distributed will resemble a random walk - thanks!