Center and scale RIN values for DESeq2?
0
0
Entering edit mode
4.2 years ago

When running the DESeqDataSetFromMatrix function, I get a warning: "the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). it is generally a good idea to center and scale numeric variables in the design to improve GLM convergence."

The variable flagged are my RIN values. Is it a good idea to center and scale these?

deseq2 rin • 3.6k views
ADD COMMENT
1
Entering edit mode

Should RIN values really have that large a range?

ADD REPLY
0
Entering edit mode

Apologies, I should have clarified. The mean RIN is >5, the SD is not (0.21).

ADD REPLY
0
Entering edit mode

The mean RIN is obviously not the problem. The software thinks that the mean deviation or the standard deviation is > 5. I find it a litte surprising that a list of numbers ranging between 5 and 10 has such a large standard deviation.

ADD REPLY
0
Entering edit mode

The mean RIN value is 9.21, which is larger than 5, and hence caused the warning. The standard deviation of the RINs is 0.21. Is this a large standard deviation?

ADD REPLY
0
Entering edit mode

Again, I don't think that's what the warning means. I think it was telling you that the mean deviation was > 5.

ADD REPLY
0
Entering edit mode

All of the values are between 8.9-9.4

ADD REPLY
0
Entering edit mode

And you don't think it's strange that a list like that has a mean deviation or standard deviation of > 5? Those values are all excellent, why do you think they skewing your data such that it needs correcting?

ADD REPLY
0
Entering edit mode

It would be strange, but that's because I don't think that's what the warning is saying. It's suggesting to center and scale the data, i.e., "center" it to have a mean of 0 and "scale" it to have a standard deviation of 1.

ADD REPLY
0
Entering edit mode

Why do you think this DESeq object can be made with no errors? Time-course RNA-seq analysis in Deseq2

ADD REPLY
0
Entering edit mode

It's in the source code: https://rdrr.io/bioc/DESeq2/src/R/AllClasses.R

designVarsNumeric <- sapply(designVars, function(v) is.numeric(colData(se)[[v]]))
if (any(designVarsNumeric)) {
  msgIntVars <- FALSE
  msgCenterScale <- FALSE
  for (v in designVars[designVarsNumeric]) {
    if (all(colData(se)[[v]] == round(colData(se)[[v]]))) {
      msgIntVars <- TRUE
    }
    if (mean(colData(se)[[v]]) > 5 | sd(colData(se)[[v]]) > 5) {
      msgCenterScale <- TRUE
    }
  }
  if (msgIntVars) {
    message("  the design formula contains one or more numeric variables with integer values, specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function")
      }
      if (msgCenterScale) {
        message("  the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). it is generally a good idea to center and scale numeric variables in the design
  to improve GLM convergence.")
      }
    }
ADD REPLY
0
Entering edit mode

Do samples separate in the PCA by RIN?

ADD REPLY
0
Entering edit mode

I'm not sure what you mean. The warning is not referring to the counts matrix, it is referring to the design matrix, which has metadata (RIN, experimental group, etc.). The only numerical column in the matrix is RIN, and you can't do PCA on 1D data.

ADD REPLY
0
Entering edit mode

My colleague means a situation like, for example, here (see ESR1): https://github.com/kevinblighe/PCAtools#colour-by-a-continuous-variable-and-plot-other-pcs

ex12a-1

That is, do your samples segregate based on RIN status? My gut feeling is to always leave RIN out of the model.

ADD REPLY
0
Entering edit mode

They do not segregate based on RIN. Why do you leave it out?

ADD REPLY
0
Entering edit mode

If there is no evidence of RIN introducing bias into the data, then including in your your design formula could ironically introduce bias where there previously existed none. I find that people sometimes 'over worry' about controlling for batch effects in gene expression data. Batch effects can be easily mitigated through solid experimental design and consistent laboratory prep work.

ADD REPLY
0
Entering edit mode

I was asking whether the samples show signs of batch effect in the PCA which can, at least in part, be explained by RIN. If not then there is little point to even bother with RIN as a covariate. I was always wondering whether continuous RIN makes sense as a covariate as it assumes a linear relationship between RIN and the created bias. I'd check by PCA (colored by RIN) on the log2-counts or vst/rlog whether you can see that RIN has any effect on the samples.

ADD REPLY
0
Entering edit mode

All samples were processed in the same batch.

ADD REPLY
0
Entering edit mode

You are not getting my point. RIN is the potential batch effect since it is apparently largely different between the samples. On this PCA above, is that done on the log2-scale expression values?

ADD REPLY
0
Entering edit mode

The RIN is not largely different between the samples. The mean of the RINs is >5, which triggered the warning, but the SD is small (0.21). The PCA is of raw counts.

ADD REPLY
0
Entering edit mode

PCA must not be done on raw counts since it will be confounded by sequencing depth plus the spread is large as data are not on the log2 scale. I suggest you do PCA on log2 normalized counts or even better on the output of vst.

ADD REPLY
0
Entering edit mode

FWIW, I would only use a RIN score if a low RIN score sample also looked like an outlier in the PCA. Then I could justify removing it, on the grounds that the RIN score shows that a messed up RNA prep is skewing its counts.

ADD REPLY

Login before adding your answer.

Traffic: 2668 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6