My project is working on a large dataset of RPKM values for patients with and without Schizophrenia.
After some preprocessing steps including dumping genes with lots of zero RPKM values and log2-transforming, I have applied Non-negative Matrix Factorization (NNMF) as a dimensionality reduction technique. I am looking for statistically significant correlations between the resulting groups of genes ('metagenes') and schizophrenia.
Until now, I have been using a simple t-test, with Bonferroni correction, to test the metagene expression values for correlation with Schizophrenia. I think that the normality condition is fine because there are about 150 cases and 170 controls - so CLT holds. Some of the results have such very low adjusted p-values that I am relatively certain I have found something interesting.
However, I need to be sure absolutely sure that this is not down to confounding factors. There are slight imbalances by demographic in the schizophrenia vs. non-schizophrenia groups - I need to correct for a few variables, both discrete and continuous - the full list I want to correct for is: Age, Sex, Race, Smoking or not, Postmortem interval, sample pH, and RNA integrity number.
Is there a statistical test, more advanced than the t-test, that can be applied that will ACCOUNT for the impact of these confounding covariates, and make sure that I really have found statistically significant correlations with Schizophrenia? If there is not, then can you recommend how I could change my procedure to best guard against the the confounding factors?
Want to make sure I'm reporting solid results! Thanks for any help you can give.
Hi Kevin, Thanks for your answer. What if either of these is statistically significant happened in my sample information?
I do not understand what you mean
I am sorry. I mean if two varies are confounded together, how to deal with this stuff?
How would write the loop of the logistic regression model if instead of "Schizophrenia" (so the treatment condition) you want to see how each gene is affected by age and sex independently from the health status of the patient (and so basically and identify the residual)?
Hi, I think that the same idea applies but that you just use a different formula? The above method is inefficient due to the fact that just a
for
loop is being usedYou may get what you need via my other Bioconductor package, RegParallel? - https://bioconductor.org/packages/release/data/experiment/vignettes/RegParallel/inst/doc/RegParallel.html#perform-a-basic-linear-regression
Does it work only on normalised the raw counts (DESeq2) or can be used also on RPKMs?
These models should preferably be run on, e.g., the regularised log or variance-stabilised expression levels, if using DESeq2. If you have RPKM, I would
log2(RPKM + 0.1)
these.As I understand you do indeed just want a formula of form:
..or:
So, just a linear regression via
lm()
. To re-use the above code, that would be something like:Quick example