I have received RNA-seq (paired end) data from my sequencing facility.
Details are like this. This is human PBMC samples for 98 participants and two days day0 and day7 (corresponding to pre and post vaccination) - in total 196 samples. 54 out of 98 were high responders and 44 were low responders.
I had the data randomised for them based on both factors so that they can use it in batches of three as we were going to produce same data for metabolomics in three batches. I kept the biological variation same in three batches based on both factors and also the visits for same partcipants in same batch. Now theyt are telling me this:
The “run #” actually refers to the sequencing run for the samples and not associated with any other process.
“run 5” and “run 6” refers to the NovaSeq run – where we targeted >40M reads/samples.
Regarding the processing and batching, below are the details:
Extraction (multiple batches) –> Followed by library prep (2 batches of 95 samples + 8 samples in last batch) –> Followed by sequencing (~50 samples/lane - ~4 lanes total sequenced across 2 sequencing runs (run 5 and run 6))
Question 0: This means I have a run variable where run5 has ~50 samples and run6 has ~150 samples. and library prep had three batches of 95 samples in two batches and 8 samples in last batch.
I am not sure I can use combat seq for this imbalanced design at all? because assumption is similar biological variation across batches which wont be the case in both.Am I thinking right?
Question 1: What is best solution for moving forward on this scenario? I do know that I could incorporate the batch in DE analysis e.g. when I am using edger, deseq2 or limma but what would I do if I need to do other analysis like correlation , WGCNA?
Question 2: Also If I am using sva for finding out the other hidden batch effects ? Should I incorporate in sva?
a. my all biological variables of interest e.g. visit and response status? b. should I also control for know batch effects while trying to find out others e.g.
mod <- model.matrix(~ visit+response_status+batchvariable1+batchvariable2, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
svseq <- svaseq(dat, mod = mod, mod0 = mod0, n.sv = 3)
Kevin Blighe pls could you help me here.
First of all, I would first try to measure/visualize if there is any real problem there with the batches. Probably with a PCA or something similar. 2) Have you checked that if there is a batch effect there is a causal path that could confound the answers you are looking for? If you decide it might you can remove the batch effect of the data instead of adjust for it.
Last, I want to promote a tool I designed to prevent this batch problems: https://experdesign.llrs.dev/. Specifically,
check_data(colData(dds))
will help detecting problems with your multiple conditions and variables of interest: Do all the 8 samples of the last batch belong to one condition? Have you some samples in several batches to adjust better for batch effect? The package also has some functions for planning how to batch the samples before the experiment is done.