Question

Integrating RNA-Seq datasets from different experiment

5

Entering edit mode

7.2 years ago

lessismore ★ 1.4k

Hey all,

i am integrating RNA-seq dataset which i mapped with kallisto against the reference genome. Now i have TPM for all of them which have been filtered for keeping only expressed genes.

My question is: in integrating the different datasets (belonging to different experiments) would you further normalize this whole dataset (e.g. log2 transform it and quantile normalize, or apply TMM, etc.. ), or would you go directly to the batch effect correction?

Thanks in advance

RNA-Seq normalization • 4.9k views

ADD COMMENT • link updated 7.2 years ago by Kevin Blighe 88k • written 7.2 years ago by lessismore ★ 1.4k

Kevin Blighe · Answer 1 · 2017-10-10

2

Entering edit mode

7.2 years ago

Kevin Blighe 88k

Hey,

I would do neither of the above suggestions. I would input the Kallisto raw counts from all samples into DESeq2 using tximport, and then include batch as a factor in the design formula of DESeq2. Take a look here for a tutorial from Michael Love and colleagues: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html

I would not advise attempting to directly correct for batch on your raw or TPM counts. It is better to include batch as a covariate or blocking factor in your statistical models. These are recommendations HERE and HERE by statisticians working in the field of expression data normalisation and batch correction. Others have other opinions though, as always.

Good luck!

Kevin

ADD COMMENT • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Hey @Kevin,

thank you for your answer. Ive read these papers. My final aim is Network analysis, thats why once i import raw counts in DESeq2, i dont want to use any model there because my idea is to use the batch corrected dataset as an input for another program for network building. Thats why i was preferring to log2 transform the TPM for better handling the data, quantile normalize it for making the distributions uniform and correct for batch effects for removing the unwanted variation coming from several experiments (users, dates, etc). Then ill have the input i want for the followin analysis. What do you think?

ADD REPLY • link 7.2 years ago by lessismore ★ 1.4k

1

Entering edit mode

Hey, are you aiming to use WGCNA for network analysis, or something else?

From the DESeq2 objects, it's possible to extract raw, normalised, variance-stabilised, and regularised log-transformed counts, which should be sufficient(?). The normalised counts would hopefully be batch-corrected, as batch would be included in the design formula during normalisation.

Edit 18th June 2018:

if including batch as a covariate in design formulae, in order to correct the counts for downstream analysis like WGCNA, ensure that blind=FALSE is set when using the vst() or rld() functions

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Hey Kevin, again very helpful. I was thinking to WGCNA. On their website they suggest to correct with ComBat that's why your advice changes the plans.

ADD REPLY • link 7.2 years ago by lessismore ★ 1.4k

0

Entering edit mode

Yes, Steve Horvath worked in the lab where I was based in Boston - they use WGCNA extensively there. Based on the published manuscripts on batch correction (which we've both read), they state that ComBat and other similar methods are fine if the dataset is balanced.

I guess that what you should do is first see if there is indeed any batch effect. You can do PCA to visually check if the samples segregate based on sampling date, batch, etc. You can also correlate these parameters to the first 5 or 10 PC eigenvectors to see if any significant correlations exist (use cor.test() in R I think - String-based factors will have to be converted to numerical factors).

ADD REPLY • link 6.1 years ago by Kevin Blighe 88k

1

Entering edit mode

I guess that what you should do is first see if there is indeed any batch effect. You can do PCA to visually check if the samples segregate based on sampling date, batch, etc.

Hey Kevin, yes indeed is what i did and i observed a batch effect, even if not so strong. The TPM normalization should have reduced it.

You can also correlate these parameters to the first 5 or 10 PC eigenvectors to see if any significant correlations exist (use cor.test() in R I think - String-based factors will have to be converted to numerical factors).

Ill let you know! thanks again, always crucially helping :)

ADD REPLY • link updated 6.1 years ago by Kevin Blighe 88k • written 7.2 years ago by lessismore ★ 1.4k