Question

Feature selected-genes vs. differentially expressed genes

1

Entering edit mode

5.1 years ago

randalljellis ▴ 100

I'm doing some machine learning on RNA-seq samples to predict sample group, and using feature selection techniques built into scikit-learn to rank genes. Specifically, I'm using the F-test and Gini importance (built into random forest classifiers). I'm using the variance-stabilized transformed counts outputted from DESeq2. However, I'm noticing that there is very little agreement between the feature-selected genes and the differentially expressed genes (also calculated using DESeq2). I understand that the statistical methods used for differential expression differ from the F-test and Gini importance, but I was wondering if anyone could offer deeper clarification on this, or references to read.

machine learning RNA-Seq feature selection • 3.8k views

ADD COMMENT • link updated 5.0 years ago by bdy8 ▴ 80 • written 5.1 years ago by randalljellis ▴ 100

1

Entering edit mode

Yes, to understand differences you would want to compare the differential expression method you've used against the F-test and Gini index methods used by your feature selection, in the context of your experimental design. If this is a one-factor design, for example, and you used ANOVA to select differentially-expressed genes, then that's a F-test, So I'd guess (not knowing details of your feature selection) that in that case the differential-expression selected genes might be in reasonable agreement with feature-selected genes if feature selection was done by F-test. Because they would be using the same gene selection criterion. In the Gini case I believe your random forest classifier would be using Gini index to measure reduction in impurity in a tree-split, which, unlike an F-test, would not necessarily be taking into account the magnitude of the difference in gene expression between experimental groups relative to experimental noise. A fairly recent reference on use of Gini for variable importance is here, might be worth a read.

ADD REPLY • link 5.1 years ago by Ahill ★ 2.0k

1

Entering edit mode

Thank you for the reference. I don't think DESeq uses an F-test to calculate differential expression: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#theory

The DESeq2 model and all the steps taken in the software are described in detail in our publication (Love, Huber, and Anders 2014), and we include the formula and descriptions in this section as well. The differential expression analysis in DESeq2 uses a generalized linear model of the form:

Kij∼NB(μij,αi)

μij=sjqij

log2(qij)=xj.βi

where counts Kij for gene i, sample j are modeled using a negative binomial distribution with fitted mean μij and a gene-specific dispersion parameter αi. The fitted mean is composed of a sample-specific size factor sj and a parameter qij proportional to the expected true concentration of fragments for sample j. The coefficients βi give the log2 fold changes for gene i for each column of the model matrix X. Note that the model can be generalized to use sample- and gene-dependent normalization factors sij.

The dispersion parameter αi defines the relationship between the variance of the observed count and its mean value. In other words, how far do we expected the observed count will be from the mean value, which depends both on the size factor sj and the covariate-dependent part qij as defined above.

Var(Kij)=E[(Kij−μij)2]=μij+αiμ2ij

An option in DESeq2 is to provide maximum a posteriori estimates of the log2 fold changes in βi after incorporating a zero-centered Normal prior (betaPrior). While previously, these moderated, or shrunken, estimates were generated by DESeq or nbinomWaldTest functions, they are now produced by the lfcShrink function. Dispersions are estimated using expected mean values from the maximum likelihood estimate of log2 fold changes, and optimizing the Cox-Reid adjusted profile likelihood, as first implemented for RNA-seq data in edgeR (Cox and Reid 1987,edgeR_GLM). The steps performed by the DESeq function are documented in its manual page ?DESeq; briefly, they are:

estimation of size factors sj by estimateSizeFactors estimation of dispersion αi by estimateDispersions negative binomial GLM fitting for βi and Wald statistics by nbinomWaldTest

ADD REPLY • link 5.1 years ago by randalljellis ▴ 100

0

Entering edit mode

You are certainly right - the GLM approach in DeSeq2 is not a F-test, which would make assumptions of equal variance, etc. that are not made in DESeq2. Assuming you submitted all observed transcripts to the random forest classifier, and not only DESeq2-DE genes, it would be interesting to look at the pre-transformation expression patterns of the genes selected by the random forest feature selection that were not selected by DESeq2. Are any of them low expression, artifacts? In the big picture, if the goal is to identify a predictor of sample group with highest performance, I'd speculate that using gene sets selected by tools like DESeq2, that were specifically designed to identify differentially expressed transcripts in RNA-Seq data, using a regression/variance model that is matched to the characteristics of that data, might in the long run outperform more generic technology-agnostic feature-selection approaches like F-test/Gini. But I don't have evidence for it, and note you are submitting the variance-stabilized transformed counts to the random forest. I'd imagine someone must have published on this feature selection question, but I'm not aware of a good reference. Perhaps others can comment as well.

ADD REPLY • link 5.1 years ago by Ahill ★ 2.0k

0

Entering edit mode

I am also in this same boat and having the same question and queries about the best feature selection methodology to use.

From my literature searching there does not seem to be a gold standard for this so I have so far been comparing feature selected genes against DeSeq2 significant genes. There does seem to be significant overlap but, and as you would hope, a massively reduced number of genes identified via the feature selection methods.

Have you also been comparing your feature selected genes against your DeSeq2 results ?

ADD REPLY • link 5.0 years ago by bdy8 ▴ 80

0

Entering edit mode

How many samples do you have to start with ? I would suggest to stick to DESeq2 results if its bulk RNA-Seq and if you dont have hundreds of samples. Whats the percent of variation between two groups of samples ? Machine learning approaches works better for single cell data because of 1000s of cells to classify cell types and to identify cell-type specific markers.

ADD REPLY • link 5.0 years ago by GouthamAtla 12k

0

Entering edit mode

Hello geek_y. While this is in response to the original question I though I would add in my data set as well. I have 84 samples and 3 groups, pre exposure, exposure no disease signs and exposure disease signs. I get great sample clustering through PCA analysis (exposure no disease signs sits between pre exposure and exposure disease signs). I have found this What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis? which is a pretty great pipeline from Kevin, so I am now currently taking my DEG and performing downstream analysis to reduce number of features. If you want any more info let me know :) I understand that it works well for SSRNA-seq but from reading the literature it does seem it has a place in bulk RNA-seq as well. I do thin the true power being the reduced feature selection that can be implemented thus identifying smaller groupings of genes to focus on. Ben

ADD REPLY • link 5.0 years ago by bdy8 ▴ 80

0

Entering edit mode

With your sample size and experimental setup, I think differential gene expression is more reliable than any regression based analysis. Regression based methods exists from a long time and there is a reason why there are not used in bulk RNA-Seq differential expression. There is a reason why they are emerging with single cell data. I guess 84 samples doesn't have enough power, especially if the biological differences between conditions is not very very drastic.

ADD REPLY • link 5.0 years ago by GouthamAtla 12k