Hey all,
i am integrating RNA-seq dataset which i mapped with kallisto against the reference genome. Now i have TPM for all of them which have been filtered for keeping only expressed genes.
My question is: in integrating the different datasets (belonging to different experiments) would you further normalize this whole dataset (e.g. log2 transform it and quantile normalize, or apply TMM, etc.. ), or would you go directly to the batch effect correction?
Thanks in advance
Hey @Kevin,
thank you for your answer. Ive read these papers. My final aim is Network analysis, thats why once i import raw counts in DESeq2, i dont want to use any model there because my idea is to use the batch corrected dataset as an input for another program for network building. Thats why i was preferring to log2 transform the TPM for better handling the data, quantile normalize it for making the distributions uniform and correct for batch effects for removing the unwanted variation coming from several experiments (users, dates, etc). Then ill have the input i want for the followin analysis. What do you think?
Hey, are you aiming to use WGCNA for network analysis, or something else?
From the DESeq2 objects, it's possible to extract raw, normalised, variance-stabilised, and regularised log-transformed counts, which should be sufficient(?). The normalised counts would hopefully be batch-corrected, as batch would be included in the design formula during normalisation.
Edit 18th June 2018:
if including batch as a covariate in design formulae, in order to correct the counts for downstream analysis like WGCNA, ensure that
blind=FALSE
is set when using thevst()
orrld()
functionsHey Kevin, again very helpful. I was thinking to WGCNA. On their website they suggest to correct with ComBat that's why your advice changes the plans.
Yes, Steve Horvath worked in the lab where I was based in Boston - they use WGCNA extensively there. Based on the published manuscripts on batch correction (which we've both read), they state that ComBat and other similar methods are fine if the dataset is balanced.
I guess that what you should do is first see if there is indeed any batch effect. You can do PCA to visually check if the samples segregate based on sampling date, batch, etc. You can also correlate these parameters to the first 5 or 10 PC eigenvectors to see if any significant correlations exist (use
cor.test()
in R I think - String-based factors will have to be converted to numerical factors).Hey Kevin, yes indeed is what i did and i observed a batch effect, even if not so strong. The TPM normalization should have reduced it.
Ill let you know! thanks again, always crucially helping :)