Hi all,
I'm looking for tools which can be used to check the importance of covariates (either continuous or categorical) in explaining information in data (e.g. gene expression data), so as to screen which variables one may want to adjust for when testing in a linear model framework (in limma, DESeq2, etc.).
For example, I have often used the pcrplot
of the ENmix R package, which correlates variables to principal components and gives this useful plot:
(And of course there is always visual screening of the PCAs coloring by variables).
But I'm wondering if anyone knows of more sophisticated methods, or methods from which one can extract more "objective" stats to justify subsequent inclusion/exclusion of variables in the models. For example I've seen the R package pvca but it only works with categorical covariates.
Or else, what is your usual process when you want to do differential testing through linear models and have a lot of phenotypical associated variables?
thanks!
Search for feature selection in machine learning. Some approaches such as lasso or tree-based methods (e.g. random forest, xgboost) output a variable importance. Another popular approach is recursive feature selection.
Thanks! I thought feature selection methods were generally applied to choose/collapse features that make up the info in the data (e.g. genes), so my issue may be a bit different: I'm talking about having two dataframes/matrices: one "A" with the data/features (the gene measurements), and another "B" with other, varied, covariates (e.g. age, sex, batch...), and the goal is to perform differential testing (not even building predictive models) in the "A" data frame, which contains the features "of interest". I could combine the "A" and "B" dataframes to perform feature selection across everything but that would be if my goal were to build a "predictive" model using some genes + the other covariates which best separate some groups; but I just want to use the genes for testing between conditions.
(although I have little experience in ML and may have misunderstood your suggestions)
I am not sure what you're trying to achieve. Since you mention linear models, you could also compute different models and select one based on an information criterion. This previous post may also be of interest.
OK, I'll start from there, thanks!