Feature Selection in Large Dataset with High Multicollinearity
2
0
Entering edit mode
2.5 years ago

Hi all,

This is my first post here! If there is any way I can improve the quality of my posts, please let me know!

I am currently working on a research project that involves investigating whether certain patterns of aberrant alternative RNA splicing in leukaemia can be used as prognostic biomarkers.

Following analysis of RNA-Seq data using MAGIQ by a fellow researcher, I have been provided with a data matrix with 128 patients (rows) and 31,867 RNA splicing events. The values within each cell are the 'percent spliced in' values for a particular splicing isoform in a particular patient.

My original approach was to utilise a random forest classifier (with the dichotomous/boolean label being whether a given patient survived beyond 3 years or not) in order to determine feature importances, and then to take the most important of these features and perform an L1-regularised Cox Proportional Hazards regression with these features in order to determine which of these top features contributed significantly to survival. These 'important splicing events' would then be validated on a previously unseen validation cohort/dataset.

Whilst working on this problem, I noted that multicollinearity affects the interpretability of all regression techniques, and can affect the feature importances assigned using a random forest (since the impurity explained by one feature earlier in a given tree would dramatically reduce the additional impurity explained by strongly correlated features further down in the tree). Given some confusion around the meaning of 'interpretability', in this case I am referring to the ability for a researcher to use the coefficients assigned to a particular feature by the Cox regression as a measure of feature importance/contribution. To investigate whether I would be affected by these issues, I calculated the VIF (Variance Inflation Factor) on a subset of features within my dataset, and found that multicollinearity was a huge issue (with many VIF values being above 10 and a very large number being above 1,000).

Given that the aim of this research project necessitates model interpretation in order to determine the RNA splicing events/isoforms that have a significant relationship with survival, my understanding is that the multicollinearity within my dataset must be removed. Calculating the VIF for each feature with respect to every other feature and then eliminating features incrementally is not only resource intensive (VIF calculation for all 31,867 features will take approx. 5 days on my current hardware, so the time to incrementally remove the feature with the highest VIF and calculate all the VIFs again over and over is simply not feasible), but I am concerned that simply removing high VIF features would be 'throwing out' potentially important information that may be relevant to downstream biological interpretation. I have also read that LASSO regularisation can deal with multicollinearity by driving the coefficients of strongly correlated features to zero, but once again I am concerned that this behaviour might be unreproducible (in the sense that different features would be eliminated each time the regression is run) as well as potentially removing important features relevant to biological interpretation.

Guides I have found online recommend either combining correlated features together, or dropping features entirely. Given that the VIF does not indicate with which other features a particular feature is correlated, it is difficult to automate 'merging' of correlated features across all 31,867 RNA splicing events. I have also read that dimensionality reduction techniques such as PCA (Principal Component Analysis) will produce uncorrelated output features that can be used for downstream analysis, but I am unsure of how to determine how many principal components to form, or which of the features making up each principal component are responsible for causing certain principal components to exhibit a significant relationship with survival in the Cox model.

Finally, I am not sure what the accepted 'gold standard' is when selecting 'top' features from a random forest classifier. I am currently taking the top 0.1% of features, but this was a largely arbitrary choice. Is there a particular method to determine this threshold, or a convincing argument for a commonly used threshold?

Thank you all so much for your time!

multicollinearity selection feature cox interpretability model regression • 1.7k views
ADD COMMENT
5
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 28k

I don't think it is possible to give you a proper advice without writing a longer post than yours. I will therefore give you some guidelines, and the rest will be up to you to research in depth.

What you have is an underdefined system (sometimes also called underdetermined). In simple terms, your data matrix is too wide (too many variables) and not long enough (too few samples). These problems tend to have many complex solutions, and probably the only way to solve them is with fewer variables and using simpler models. So random forests are probably out of question, and using Lasso (L1-based regularization) will natively shrink many feature coefficients to zero and remove them. Specifically, I recommend Lasso with cross-validation, and with 10-20 folds given your number of samples:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

But before you get to Lasso, you correctly noted that one must remove colinearity. VIF is a good way to do it, and one of its implementations is described here. Simply remove the feature with largest VIF after each round of calculation. As you figured out, it would take a very long time to calculate VIF repeatedly for >30K features, so I suggest you remove a bunch of them because they are invariant or by using simple correlation calculation. The code is below, and it will take a long time and use a lot of memory, but hopefully it will be doable.

import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

df = pd.read_csv('your_file_name_here.csv')
# Remove invariant features
var_thr = VarianceThreshold(threshold = 0)
var_thr.fit(df)
to_drop = [column for column in df.columns if column not in df.columns[var_thr.get_support()]]
df.drop(to_drop, axis=1, inplace=True)

# Create correlation matrix
corr_matrix = df.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
print('\n Correlated features removed: %d' % len(to_drop) )
# Drop features
df.drop(to_drop, axis=1, inplace=True)
print('Features remaining:', len(df.columns))

It may be prudent to remove features with correlation > 0.9 if the above still leaves you with a large number. As to your worry that you may be throwing away signal by removing some features: it is possible, but that is definitely not your biggest problem. For the moment I would focus on trying to get the matrix down to a reasonable number of variables.

I have already spent more time on this than I intended, so here is a summary:

  • look for columns where all rows are invariant and remove them (code provided)
  • simple correlation calculation, remove columns with CC > 0.95 (code provided)
  • calculate VIF, remove the column with largest value, iterate until all columns have VIF < 10
  • let LassoCV remove some features by shrinking their coefficients to zeros
ADD COMMENT
0
Entering edit mode

Thank you very much for your insight! Removing invariant features and computing a simple correlation calculation prior to the VIF should ensure a substantial boost in speed. Thank you also for providing code demonstrating your envisioned pipeline.

I just wanted to clarify exactly what you meant when you stated:

"So random forests are probably out of question"

Is this because random forests will not adequately perform feature selection in this underdefined /underdetermined system?

ADD REPLY
2
Entering edit mode

Underdefined systems have a huge number of complex solutions that do not generalize well. It is often better to find a simple solution that may be a couple of percent sub-optimal than to weed out all the complex solutions that seemingly work better but do not have general applicability.

Typically, more complex models require more data. Random forests are definitely in that category, and it is very easy to overfit a RF when data is scarce. One way around it is to keep the trees very shallow (maximum depth 2-4), at which point RFs tend not to be much better than generalized linear models. That is why I would suggest to leave out RFs, gradient boosters, neural networks and all other fancy models because they shine with different types of datasets. Even GLMs can easily overfit with the data you have, so that's why I suggested cross-validation with large number of folds.

ADD REPLY
0
Entering edit mode

This is a very clear and well-written answer. Thank you!

ADD REPLY
1
Entering edit mode
2.5 years ago
LauferVA 4.5k

There's no right answer to this, but it felt too long as a comment. Anyway, here are a few thoughts, hopefully supplemented by the answers of others.

Guides I have found online recommend either combining correlated features together, or dropping features entirely. Given that the VIF does not indicate with which other features a particular feature is correlated, it is difficult to automate 'merging' of correlated features across all 31,867 RNA splicing events. I have also read that dimensionality reduction techniques such as PCA (Principal Component Analysis) will produce uncorrelated output features that can be used for downstream analysis, but I am unsure of how to determine how many principal components to form, or which of the features making up each principal component are responsible for causing certain principal components to exhibit a significant relationship with survival in the Cox model.

Finally, I am not sure what the accepted 'gold standard' is when selecting 'top' features from a random forest classifier. I am currently taking the top 0.1% of features, but this was a largely arbitrary choice. Is there a particular method to determine this threshold, or a convincing argument for a commonly used threshold?

Agnostic, eigenvector-based decomposition algorithms You probably know a lot of this, but I am writing for a more general audience so please bear with me. Principal Component Analysis (PCA) is a form of eigenvector-based decomposition. It is a descriptive statistical technique; it does not ask questions about what variables might be collinear with what; rather, it simply takes the data, finds the longest axis of variation, then records scores for the degree to which each subject "belongs to" that principal component. Then, the algorithm successively finds the next longest axis of variation, with the stipulation that it must be {orthogonal to, independent of, not correlated to, not related to} the 1st one. Then, it goes on and on and on down the way successively producing the next largest axis of variation and the next.

There is really no number of PCs that is fundamentally meaningful a priori. For other eigenvector-based decomposition algorithms (there are many), you might have to specify a number of eigenvectors ahead of time, so the question is more meaningful in those cases. But for PCA, you can continue going till you've decomposed ALL the variation in the data into orthogonal axes. If that's what you want to do, that algo can absolutely do it for you.

But I don't recommend that, although ironically PCA could be used for this project in the following way:

Quality Control

- Run a PCA on your data.
- Correlate the top 10, or 20, or 50, 100, etc. (doesnt really matter) to all of your covariates (be sure to choose the right kind of correlation coefficient). The goal of this is to identify if any of the PCs correlate with any labels (sex, age, BATCH (i.e., can scan for batch effect), treatment status, etc. 
- The purpose of this is to identify if an **UNWANTED** variables creating collinearity in your data, that do not correspond to what you want to measure (i.e., over and above what should be present biologically). Consider reading Price *et al.* (2006) to get a sense for how this was done for GWAS studies.

Once you are satisfied about QC, batch effect, PCA, and your data, you can move on to biological analysis of a dataset that is, hopefully, more free of unwanted covariance structures.

Biological Investigation: The genome is an orchestra, not a guitar solo:

Given that the aim of this research project necessitates model interpretation in order to determine the RNA splicing events/isoforms that have a significant relationship with survival, my understanding is that the multicollinearity within my dataset must be removed.

In my opinion, this thought is incredibly damaging to an understanding of biology. Almost no single datum is important on its own. Take BRCA1 positivity. I have the most famous breast cancer gene, so I will get breast cancer. Right? Wrong. Penetrance varies from 8-80% depending on the rest of your genetic background - in other words depending upon covariance.

Rather, it is important that the above QC process is done well, because biological data themselves are and ought to be heavily collinear. Why? Because genes don't function on their own, they organize into higher and higher and higher order structures to accomplish much larger scale objectives. As a simple example, think of how non-sensical it would be for a bacteria to upregulate one gene that makes up a flagellar subunit, without any others. What would it do with just one piece? It couldn't make a flagellum, but the protein is only useful insofar as it is part of a flagellum...so would be pointless.

So, the trick here is to be able to relate many changes made in a coordinate fashion to some other known biological entity... For instance, let's say you have a bunch of splice variants, but you can map them all back to interferon gamma signaling. You might also find that TLR3 signaling and Stat1 signaling also relate; for that dataset all those splicing signatures would be collinear and that would not be in any way problematic, rather, it is evidence that the data are meaningful.

In other words, I am suggesting that you derive meaning from your data not by resolving it into uncorrelated axes, but rather by trying to understand what the relationships that drive large-scale differences in splicing are.

I don't know much about analyzing splicing data (never been asked to do it) but if it were me, I would probably get a list of all gene products that actually carry out splicing, then get lists of every gene those splicing proteins are thought to act on. I would then see what signatures match my data. If you find a strong match, this could also give you a starting point inasmuch as it would tell you what gene complexes are doing the splicing. That plus cell type plus literature review might be enough to begin to speculate on what is happening.

Spectral Decomp and a recent manuscript Consider reading this preprint. The authors use spectral decomposition on the proteosome of several thousand bacteria. They identify layers upon layers of nested, hierarchical structures of covariance by analyzing spectral components of different magnitudes.

Notice that the top few SCs recapitulate kingdom, phylum, class, order, family, genus, species (1-40), then it goes subcellular. By the time you get to SCs in the 1000s, you are looking at minute covariance structures, perhaps such as a few dozen genes mediating some aspect of OxPhos. This manuscript may give you some ideas about how to use eigenvector based decomp for your own data.

Hope that helps in some measure.

ADD COMMENT
0
Entering edit mode

Thank you very much for your insight!

ADD REPLY

Login before adding your answer.

Traffic: 1604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6