In GSEA, how can one "correlate" gene expression with categorical phenotype data?
1
0
Entering edit mode
6.2 years ago
O.rka ▴ 740

The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B). http://www.pnas.org/content/102/43/15545

How does one calculate the correlation between gene expression (continuous values) and categorical phenotype data (strings or binary encoded data)?

For example, suppose one had the following data:

gene_expr_A = [0.3, 0.5, 0.8, 13.0, 12.3, 15.8]
phenotypes = ["healthy", "healthy", "healthy", "diseased", "diseased", "diseased"]

Would this correlation be calculated like this?

phenotypes_encoded = [0,0,0,1,1,1]
correlation = pearson(gene_expr_A, phenotypes_encoded)

Is this statistically robust? I feel like this oversimplifies the operations.

gsea correlation pearson genes expression • 3.0k views
ADD COMMENT
3
Entering edit mode
6.2 years ago

Hey,

Most things are simple when broken down and understood. I think that it is reasonable to correlate a continuous variable to the numerical equivalent of a categorical variable. This is what happens with a linear regression, after all. Take a look:

continuous <- c(45, 67, 12, 65, 75, 3, 44, 90)
categorical <- factor(c(0,0,0,0,1,1,1,1))

cor(continuous, as.numeric(categorical)) ^ 2
[1] 0.01024737

summary(lm(continuous ~ categorical))$r.squared
[1] 0.01024737

Between both of these, I would feel more comfortable reporting the regression values along with the p-value from the regression fit. Be careful on the interpretation of what this 'correlation' means, too.

Kevin

ADD COMMENT
0
Entering edit mode

Hey Kevein, thanks this is making a lot of sense once I plotted data. I don't know why but it always seemed incorrect to look at correlations in this way but you're right and I definitely get it now.

Also, your linear model at the end is interesting. I'm pretty new to linear models used in this way so please let me know if I understand this correctly.

y = beta*x + bias_constant + epsilon

where y is the gene expression value from continuous, beta is the coefficient multiplied against x which is either 0 or 1 depending on the phenotype, bias_constant is the y intercept, and episilon is some normally distributed error. The fit for the model measured by R^2 is the pearson correlation between the 2 vectors squared?

I've seen R^2 that are between -1 and 1. How would the negative R^2 values be computed in this way?

Thanks again.

ADD REPLY
1
Entering edit mode

The negative r-squared is explained very well here: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative

For your other questions, I also point you to other material:

[Biostars is more for general bioinformatics, not statistics]

Remember, of course, that cor() and lm() will only produce the same value in a select few cases.

ADD REPLY
1
Entering edit mode

I will keep that in mind for the future. Thanks for the help even though this was out of scope. These links are really useful.

ADD REPLY

Login before adding your answer.

Traffic: 2730 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6