The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B). http://www.pnas.org/content/102/43/15545
How does one calculate the correlation between gene expression (continuous values) and categorical phenotype data (strings or binary encoded data)?
For example, suppose one had the following data:
gene_expr_A = [0.3, 0.5, 0.8, 13.0, 12.3, 15.8]
phenotypes = ["healthy", "healthy", "healthy", "diseased", "diseased", "diseased"]
Would this correlation be calculated like this?
phenotypes_encoded = [0,0,0,1,1,1]
correlation = pearson(gene_expr_A, phenotypes_encoded)
Is this statistically robust? I feel like this oversimplifies the operations.
Hey Kevein, thanks this is making a lot of sense once I plotted data. I don't know why but it always seemed incorrect to look at correlations in this way but you're right and I definitely get it now.
Also, your linear model at the end is interesting. I'm pretty new to linear models used in this way so please let me know if I understand this correctly.
y = beta*x + bias_constant + epsilon
where
y
is the gene expression value fromcontinuous
,beta
is the coefficient multiplied againstx
which is either 0 or 1 depending on the phenotype,bias_constant
is the y intercept, andepisilon
is some normally distributed error. The fit for the model measured by R^2 is the pearson correlation between the 2 vectors squared?I've seen R^2 that are between -1 and 1. How would the negative R^2 values be computed in this way?
Thanks again.
The negative r-squared is explained very well here: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative
For your other questions, I also point you to other material:
[Biostars is more for general bioinformatics, not statistics]
Remember, of course, that cor() and lm() will only produce the same value in a select few cases.
I will keep that in mind for the future. Thanks for the help even though this was out of scope. These links are really useful.