Entering edit mode
4.2 years ago
randalljellis
▴
90
When running the DESeqDataSetFromMatrix function, I get a warning: "the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). it is generally a good idea to center and scale numeric variables in the design to improve GLM convergence."
The variable flagged are my RIN values. Is it a good idea to center and scale these?
Should RIN values really have that large a range?
Apologies, I should have clarified. The mean RIN is >5, the SD is not (0.21).
The mean RIN is obviously not the problem. The software thinks that the mean deviation or the standard deviation is > 5. I find it a litte surprising that a list of numbers ranging between 5 and 10 has such a large standard deviation.
The mean RIN value is 9.21, which is larger than 5, and hence caused the warning. The standard deviation of the RINs is 0.21. Is this a large standard deviation?
Again, I don't think that's what the warning means. I think it was telling you that the mean deviation was > 5.
All of the values are between 8.9-9.4
And you don't think it's strange that a list like that has a mean deviation or standard deviation of > 5? Those values are all excellent, why do you think they skewing your data such that it needs correcting?
It would be strange, but that's because I don't think that's what the warning is saying. It's suggesting to center and scale the data, i.e., "center" it to have a mean of 0 and "scale" it to have a standard deviation of 1.
Why do you think this DESeq object can be made with no errors? Time-course RNA-seq analysis in Deseq2
It's in the source code: https://rdrr.io/bioc/DESeq2/src/R/AllClasses.R
Do samples separate in the PCA by RIN?
I'm not sure what you mean. The warning is not referring to the counts matrix, it is referring to the design matrix, which has metadata (RIN, experimental group, etc.). The only numerical column in the matrix is RIN, and you can't do PCA on 1D data.
My colleague means a situation like, for example, here (see ESR1): https://github.com/kevinblighe/PCAtools#colour-by-a-continuous-variable-and-plot-other-pcs
That is, do your samples segregate based on RIN status? My gut feeling is to always leave RIN out of the model.
They do not segregate based on RIN. Why do you leave it out?
If there is no evidence of RIN introducing bias into the data, then including in your your design formula could ironically introduce bias where there previously existed none. I find that people sometimes 'over worry' about controlling for batch effects in gene expression data. Batch effects can be easily mitigated through solid experimental design and consistent laboratory prep work.
I was asking whether the samples show signs of batch effect in the PCA which can, at least in part, be explained by RIN. If not then there is little point to even bother with RIN as a covariate. I was always wondering whether continuous RIN makes sense as a covariate as it assumes a linear relationship between RIN and the created bias. I'd check by PCA (colored by RIN) on the log2-counts or vst/rlog whether you can see that RIN has any effect on the samples.
All samples were processed in the same batch.
You are not getting my point. RIN is the potential batch effect since it is apparently largely different between the samples. On this PCA above, is that done on the log2-scale expression values?
The RIN is not largely different between the samples. The mean of the RINs is >5, which triggered the warning, but the SD is small (0.21). The PCA is of raw counts.
PCA must not be done on raw counts since it will be confounded by sequencing depth plus the spread is large as data are not on the log2 scale. I suggest you do PCA on log2 normalized counts or even better on the output of
vst
.FWIW, I would only use a RIN score if a low RIN score sample also looked like an outlier in the PCA. Then I could justify removing it, on the grounds that the RIN score shows that a messed up RNA prep is skewing its counts.