Question

Tools for screening the influence/importance of covariates in multidimensional data

1

Entering edit mode

4.1 years ago

Papyrus ★ 3.1k

Hi all,

I'm looking for tools which can be used to check the importance of covariates (either continuous or categorical) in explaining information in data (e.g. gene expression data), so as to screen which variables one may want to adjust for when testing in a linear model framework (in limma, DESeq2, etc.).

For example, I have often used the pcrplot of the ENmix R package, which correlates variables to principal components and gives this useful plot:

pcrplot

(And of course there is always visual screening of the PCAs coloring by variables).

But I'm wondering if anyone knows of more sophisticated methods, or methods from which one can extract more "objective" stats to justify subsequent inclusion/exclusion of variables in the models. For example I've seen the R package pvca but it only works with categorical covariates.

Or else, what is your usual process when you want to do differential testing through linear models and have a lot of phenotypical associated variables?

thanks!

R PCA confounding batch regression • 2.4k views

ADD COMMENT • link 3.9 years ago by Papyrus ★ 3.1k

1

Entering edit mode

Search for feature selection in machine learning. Some approaches such as lasso or tree-based methods (e.g. random forest, xgboost) output a variable importance. Another popular approach is recursive feature selection.

ADD REPLY • link 4.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks! I thought feature selection methods were generally applied to choose/collapse features that make up the info in the data (e.g. genes), so my issue may be a bit different: I'm talking about having two dataframes/matrices: one "A" with the data/features (the gene measurements), and another "B" with other, varied, covariates (e.g. age, sex, batch...), and the goal is to perform differential testing (not even building predictive models) in the "A" data frame, which contains the features "of interest". I could combine the "A" and "B" dataframes to perform feature selection across everything but that would be if my goal were to build a "predictive" model using some genes + the other covariates which best separate some groups; but I just want to use the genes for testing between conditions.

(although I have little experience in ML and may have misunderstood your suggestions)

ADD REPLY • link 4.1 years ago by Papyrus ★ 3.1k

0

Entering edit mode

I am not sure what you're trying to achieve. Since you mention linear models, you could also compute different models and select one based on an information criterion. This previous post may also be of interest.

ADD REPLY • link 4.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

OK, I'll start from there, thanks!

ADD REPLY • link 4.1 years ago by Papyrus ★ 3.1k

score 2 · Answer 1 · 2021-07-08

2

Entering edit mode

3.9 years ago

Papyrus ★ 3.1k

I'm updating this because I came upon a nice R/Bioconductor package specifically dedicated to this issue: variancePartition. It is designed to facilitate the exploration of how covariates in a experiment explain variation in the data.

ADD COMMENT • link 3.9 years ago by Papyrus ★ 3.1k

score 1 · Answer 2 · 2021-07-08

1

Entering edit mode

3.9 years ago

Martombo ★ 3.2k

I especially like the SVA package for this kind of analysis. It is able to identify co-variates in the gene expression matrix while preserving the variation of the comparison you are focused on. Use the svaseq function for RNA-seq data, which returns a list of co-variates ranked by significance. You can then choose a subset or use them all to correct your linear model or to remove their associated variation.

ADD COMMENT • link 3.9 years ago by Martombo ★ 3.2k

1

Entering edit mode

Yes, I agree that the SVA approach is a great tool for identifying (and correcting for) latent sources of variation. Moreover, identified SVs could probably be also input into variancePartition to explore how they explain variation in comparison to known covariates, and that would surely be of interest!

Nonetheless, this post was more addressed to the more general, "unsupervised", exploration of the data. I've used SVA to great results, but one could argue that sometimes "protecting" the comparison/phenotype of interest is a bit "supervised" in the sense that you're intentionally avoiding variables with some correlation to the phenotype of interest. Sometimes one may want to include known and measured covariates if their effect is clear, even at the cost of losing some of the biological signal because of them being somewhat associated to the phenotype of interest.

ADD REPLY • link 3.9 years ago by Papyrus ★ 3.1k