I'm doing some machine learning on RNA-seq samples to predict sample group, and using feature selection techniques built into scikit-learn to rank genes. Specifically, I'm using the F-test and Gini importance (built into random forest classifiers). I'm using the variance-stabilized transformed counts outputted from DESeq2. However, I'm noticing that there is very little agreement between the feature-selected genes and the differentially expressed genes (also calculated using DESeq2). I understand that the statistical methods used for differential expression differ from the F-test and Gini importance, but I was wondering if anyone could offer deeper clarification on this, or references to read.
Yes, to understand differences you would want to compare the differential expression method you've used against the F-test and Gini index methods used by your feature selection, in the context of your experimental design. If this is a one-factor design, for example, and you used ANOVA to select differentially-expressed genes, then that's a F-test, So I'd guess (not knowing details of your feature selection) that in that case the differential-expression selected genes might be in reasonable agreement with feature-selected genes if feature selection was done by F-test. Because they would be using the same gene selection criterion. In the Gini case I believe your random forest classifier would be using Gini index to measure reduction in impurity in a tree-split, which, unlike an F-test, would not necessarily be taking into account the magnitude of the difference in gene expression between experimental groups relative to experimental noise. A fairly recent reference on use of Gini for variable importance is here, might be worth a read.
Thank you for the reference. I don't think DESeq uses an F-test to calculate differential expression: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#theory
You are certainly right - the GLM approach in DeSeq2 is not a F-test, which would make assumptions of equal variance, etc. that are not made in DESeq2. Assuming you submitted all observed transcripts to the random forest classifier, and not only DESeq2-DE genes, it would be interesting to look at the pre-transformation expression patterns of the genes selected by the random forest feature selection that were not selected by DESeq2. Are any of them low expression, artifacts? In the big picture, if the goal is to identify a predictor of sample group with highest performance, I'd speculate that using gene sets selected by tools like DESeq2, that were specifically designed to identify differentially expressed transcripts in RNA-Seq data, using a regression/variance model that is matched to the characteristics of that data, might in the long run outperform more generic technology-agnostic feature-selection approaches like F-test/Gini. But I don't have evidence for it, and note you are submitting the variance-stabilized transformed counts to the random forest. I'd imagine someone must have published on this feature selection question, but I'm not aware of a good reference. Perhaps others can comment as well.
I am also in this same boat and having the same question and queries about the best feature selection methodology to use.
From my literature searching there does not seem to be a gold standard for this so I have so far been comparing feature selected genes against DeSeq2 significant genes. There does seem to be significant overlap but, and as you would hope, a massively reduced number of genes identified via the feature selection methods.
Have you also been comparing your feature selected genes against your DeSeq2 results ?
How many samples do you have to start with ? I would suggest to stick to DESeq2 results if its bulk RNA-Seq and if you dont have hundreds of samples. Whats the percent of variation between two groups of samples ? Machine learning approaches works better for single cell data because of 1000s of cells to classify cell types and to identify cell-type specific markers.
Hello geek_y. While this is in response to the original question I though I would add in my data set as well. I have 84 samples and 3 groups, pre exposure, exposure no disease signs and exposure disease signs. I get great sample clustering through PCA analysis (exposure no disease signs sits between pre exposure and exposure disease signs). I have found this What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis? which is a pretty great pipeline from Kevin, so I am now currently taking my DEG and performing downstream analysis to reduce number of features. If you want any more info let me know :) I understand that it works well for SSRNA-seq but from reading the literature it does seem it has a place in bulk RNA-seq as well. I do thin the true power being the reduced feature selection that can be implemented thus identifying smaller groupings of genes to focus on. Ben
With your sample size and experimental setup, I think differential gene expression is more reliable than any regression based analysis. Regression based methods exists from a long time and there is a reason why there are not used in bulk RNA-Seq differential expression. There is a reason why they are emerging with single cell data. I guess 84 samples doesn't have enough power, especially if the biological differences between conditions is not very very drastic.