Hi,
Here I have a feature which was measured in patients from the BRCA dataset of TCGA (each patient has one value for this feature). It is a continuous variable. I want to find a set of genes measured by gene expression of RNA-seq which is highly correlated with this feature (so-called “gene signatures”). My plan is to use stepwise linear regression to find this set of genes.
My questions are:
Is it suitable to use stepwise linear regression here?
Could you please recommend some available package in R to do this job?
Do you have any other better choices to achieve my aim?
Thanks
Hi Kevin,
It is very nice of you to supply such precious materials to me. I have some questions after reading your post.
My response value is continuous value just like Chip-seq reads counts and my predictive values are expression of genes. Should I use “poisson” instead of “multinomial” in “cv.glmnet” function? (In the help page of “glmnet”, it recommend to use poisson when response is non-negative counts).
In the “Downstream” section of the post (https://www.biostars.org/p/287511/#287555), you built a new regression using the best predictors from “cv.glmnet”. But the best predictors as well as coefficients are already given by “cv.glmnet” (store in “co” in “Identify best predictors” section). Why do not you use these?
In my own analysis, I also plot variable deviances vs. shrinkage parameter, λ (lambda) as your “Perform 10-fold cross validation” section. But my plot is strange, the plot is below:
https://pan.baidu.com/s/1_oLWxTQDzrzoLz_V1oOrHA
Is it normal? What is the key of this plot? (Just like the scatterplot of predictor vs residuals in “lm” function which should look like stars in the sky)
Yes, you should use Poisson. The counts are normalised, correct?
The model in Downstream is constructed so that one can then perform ROC analysis, make model predictions, and derive p-values. Often, glmnet does not function too good, so, it is important to not use it as the sole method of identifying variables.
I cannot load the image(?)
Did you follow the example, here: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#poi ?
Hi Kevin,
Thank you for your replying, two additional questions:
I used the best predictors from “cv.glmnet” function to build the new model using “glm” function, then I summarized the regression using “summary” function. I found that not all predictors here are significant. Can I perform second round of stepwise regression based on the best predictors from “cv.glmnet” function?
I have check the url (https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#poi). I basically follow this manual and your post. But it does not descript the detail of this figure. I update my figure of variable deviances vs. shrinkage parameter, λ (lambda) as below. Can you see it now?
Yes, I believe it helps to perform some extra step on the predictors from glmnet, like, for example, stepwise regression.
Yes, I can see it. Can you plot a histogram of your count data?
Below is the histogram of my count data (response value):
It is generated by "hist" function
It is not a 'perfect' Poisson.. it has impurities / 'bumps'. What are the units of these?; how did you normalise the data? What is the range of values? Maybe you could try to shift the data by +1.
Do the predictors seem okay? I mean, are the results what you expected?
The response value is mutation load. It is calculated as the total number of miss-sense mutations in CDS region per patient. Here I want to explore the expression profile correlated to it. What I can say is that the distribution of this value is similar with previous works. It is hard to say how to do further normalization. In some previous works they furtherly calculated ratio between this counts and the coverage region of gene which does not matter in my mind.
The predictor value is log transformed TMM normalized RNA-seq reads counts.
How did you log-transform the RNA-seq counts? The deviance and error bars are just very big.
Do you mean that I should use TMM normalized RNA-seq counts without log-transformed as predictor value? In survival analysis it is recommanded to use log transformed TMM normalized RNA-seq counts. what is the difference?
Yes, logged data is better. However, the data in your histogram does not appear to be logged (?). Also, if you used the EdgeR program (?), you should produce logCPM counts and use that for predictors.
I think there is a missleading. The histogram is response value, which is mutation load not expression of gene (the expression value is used as predictor). Do you mean I should try log-transformed mutation load as response? the histogram of log-transformed mutation load is below which seems to be normal distribution
yes, I used EdgeR and produced logCPM
Great!
I see - thank you for confirming. You could, yes, log this response variable and change the family to Gaussian. I think that the error bars will be lower in that case.
Another thought: you could di- or tri-chotomise your response variable and switch to binomial / multinomial logistic regression with glmnet. I wrote more about this, here: