Hi all,
This is my first post here! If there is any way I can improve the quality of my posts, please let me know!
I am currently working on a research project that involves investigating whether certain patterns of aberrant alternative RNA splicing in leukaemia can be used as prognostic biomarkers.
Following analysis of RNA-Seq data using MAGIQ by a fellow researcher, I have been provided with a data matrix with 128 patients (rows) and 31,867 RNA splicing events. The values within each cell are the 'percent spliced in' values for a particular splicing isoform in a particular patient.
My original approach was to utilise a random forest classifier (with the dichotomous/boolean label being whether a given patient survived beyond 3 years or not) in order to determine feature importances, and then to take the most important of these features and perform an L1-regularised Cox Proportional Hazards regression with these features in order to determine which of these top features contributed significantly to survival. These 'important splicing events' would then be validated on a previously unseen validation cohort/dataset.
Whilst working on this problem, I noted that multicollinearity affects the interpretability of all regression techniques, and can affect the feature importances assigned using a random forest (since the impurity explained by one feature earlier in a given tree would dramatically reduce the additional impurity explained by strongly correlated features further down in the tree). Given some confusion around the meaning of 'interpretability', in this case I am referring to the ability for a researcher to use the coefficients assigned to a particular feature by the Cox regression as a measure of feature importance/contribution. To investigate whether I would be affected by these issues, I calculated the VIF (Variance Inflation Factor) on a subset of features within my dataset, and found that multicollinearity was a huge issue (with many VIF values being above 10 and a very large number being above 1,000).
Given that the aim of this research project necessitates model interpretation in order to determine the RNA splicing events/isoforms that have a significant relationship with survival, my understanding is that the multicollinearity within my dataset must be removed. Calculating the VIF for each feature with respect to every other feature and then eliminating features incrementally is not only resource intensive (VIF calculation for all 31,867 features will take approx. 5 days on my current hardware, so the time to incrementally remove the feature with the highest VIF and calculate all the VIFs again over and over is simply not feasible), but I am concerned that simply removing high VIF features would be 'throwing out' potentially important information that may be relevant to downstream biological interpretation. I have also read that LASSO regularisation can deal with multicollinearity by driving the coefficients of strongly correlated features to zero, but once again I am concerned that this behaviour might be unreproducible (in the sense that different features would be eliminated each time the regression is run) as well as potentially removing important features relevant to biological interpretation.
Guides I have found online recommend either combining correlated features together, or dropping features entirely. Given that the VIF does not indicate with which other features a particular feature is correlated, it is difficult to automate 'merging' of correlated features across all 31,867 RNA splicing events. I have also read that dimensionality reduction techniques such as PCA (Principal Component Analysis) will produce uncorrelated output features that can be used for downstream analysis, but I am unsure of how to determine how many principal components to form, or which of the features making up each principal component are responsible for causing certain principal components to exhibit a significant relationship with survival in the Cox model.
Finally, I am not sure what the accepted 'gold standard' is when selecting 'top' features from a random forest classifier. I am currently taking the top 0.1% of features, but this was a largely arbitrary choice. Is there a particular method to determine this threshold, or a convincing argument for a commonly used threshold?
Thank you all so much for your time!
Thank you very much for your insight! Removing invariant features and computing a simple correlation calculation prior to the VIF should ensure a substantial boost in speed. Thank you also for providing code demonstrating your envisioned pipeline.
I just wanted to clarify exactly what you meant when you stated:
Is this because random forests will not adequately perform feature selection in this underdefined /underdetermined system?
Underdefined systems have a huge number of complex solutions that do not generalize well. It is often better to find a simple solution that may be a couple of percent sub-optimal than to weed out all the complex solutions that seemingly work better but do not have general applicability.
Typically, more complex models require more data. Random forests are definitely in that category, and it is very easy to overfit a RF when data is scarce. One way around it is to keep the trees very shallow (maximum depth 2-4), at which point RFs tend not to be much better than generalized linear models. That is why I would suggest to leave out RFs, gradient boosters, neural networks and all other fancy models because they shine with different types of datasets. Even GLMs can easily overfit with the data you have, so that's why I suggested cross-validation with large number of folds.
This is a very clear and well-written answer. Thank you!