In my bulk RNA-seq dataset, I noticed a few outliers on MDS plot and on further investigation found that these samples originated from one extraction batch (extbatch) 7.
Fig 1: MDS plot showing outliers and that they belong to extbatch 7.
Most of the samples from batch 7 and even batch 8 had lower RNA yield than other samples.
Fig 2: Mean and SD of RNA yield over extraction batches.
So now i am trying to correct this. I tried sva::ComBat()
to correct for this extbatch and I ended up with negative values in my data. For comparison, I tried correction using scran::mnnCorrect()
. This also produced negative values.
scran::mnnCorrect(clist[[1]],clist[[2]],clist[[3]],clist[[4]],k=4)
sva::ComBat(dat=data,batch=extbatch,mod=model.matrix(~family+condition,meta),
par.prior=TRUE,prior.plot=TRUE)
Fig 3: Distribution of raw counts, counts after scran correction and combat correction. X-axis are individual samples. Y-axis are counts forced to limits -100K and 100K. Colours denote extraction batches. Red is the offending batch 7 with mostly bad samples. See appearance of negative values after batch correction.
Now, there is the argument that rather than correcting the batch, include it as a factor in my GLM model (~extbatch+family+condition
). I did that and found 2 DEGs. If I remove samples from this batch 7 from my data altogether and run my model (~family+condition
), I find many DEGs.
So, some of my questions are
- Why do I get negative values and how do I fix that?
- Is it better to discard these offending samples (I lose half my data) rather than fixing it using batch correction.
- Why is the GLM with extbatch failing to produce any DEGs while removing the batch works?
- If anyone has any better suggestions to handle this?
Just to add a comment to my own answer: If you choose to go along with DESeq2, then its regularised log transformation (post normalisation) deals very well with low counts.
You may additionally want to take a look at this thread: Surrogate Variable Analysis for Complex Experimental Design
I am using DESeq2 although I haven't mentioned it in my post. The MDS plot shown is based on VST from DESeq2. Anyway I can't use VST or RLOG transformed data for DGE in DESeq2. So far, using my real experimental variables doesn't seem to help with the batch effect. Perhaps I should create some new variables (like rna yield h/m/l you mentioned). I tried sva and created a new surrogate variable, but that didn't help.
It looks like adding a factor for RNA yield >5 and <5 (2 levels) would assist, along with the factor for Ext Batch (4 levels). As you adjust more and more, though, you run the risk of getting criticised if you aimed to publish.
Thanks for the tips. Do you think it's possible to add rnayield as a continuous variable to the glm model rather than categorising it?
It's possible
but, for DESeq2's model design when you're inputting the raw counts, only factors are allowed. If you supply a continuous variable, it will convert it to factors. If you're doing modelling later on using the normalised counts and glm() or lm(), then, by all means, a continuous variable is fine.As you begin to normalise and adjust for covariates, I would just ask that you be very careful about the conclusions you draw from the results. If I was in the position, I would consult a Full Professor in Statistics (if possible) just for corroboration.
@Kevin, are you sure?? - As far as I recall, continuous variables are valid in DESeq2's GLM.
Just to tak onto the discussion, I agree that consulting with a statistician is always a good idea if you're unsure. Two pairs of eyes are better than one. An additional avenue of exploration is to voom transform your counts and use Limma's differential expression modelling approach with a contrast matrix.
I also second the notion not to use ComBat, as it's been shown to exaggerate effects of the conditions you supply in your design matrix.
Thanks for the catch, Andrew! It does appear to accept continuous variable. I'll also quote Michael Love: "If the variable of interest is continuous-valued, then the reported log2 fold change is per unit of change of that variable" (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
I was sure that in older versions it converted continuous variable to factors. I know that the developers have done an overhaul of the modelling system in DESeq2, though.