I am going to be doing RNAseq in batches. This is the situation: Each batch will have different samples, use the same machine, use the same library kit (but not the same prep) , and separated by a few months. Do people ever use a reference sample in these instances to normalize for any batch effects. For example create a large set of aliquots of RNA of the exact same sample or pool of samples and monitor how gene values change across batches. (Lets assume for the sake of argument degradation is not an issue, and that we spread all samples across all lanes of the machine, the timescales I am thinking are approximately 1 year, maybe it is not appropriate to make this assumption?).This paper below used control samples sequenced on two machines to control for platform. "Multi-platform analysis of 12 cancer types reveals molecular classification within and across tissues-of-origin". I haven't found many other papers that do this.
From the supplement:
We used a set of 19 colon samples that were sequenced on both platforms to estimate platform differences. A limitation of this approach is that the platform correction was restricted to the 16,116 (out of the 20,531 total) genes expressed in colon, defined as those with 3 or more reads. Upper quartile normalized RSEM data was log2 transformed.
Genes with a value of zero were set to the missing value after log2 transformation and genes were filtered if they had missing data in greater than 30% of samples. For the 19 colon samples sequenced on each platform, within each dataset the gene median were calculated. The difference between the GAII platform and the HiSeq platform was calculated and subtracted from the full set of GAII data. The corrected GAII set was merged with the HiSeq data set followed by gene median centering.
Is this strategy a good or bad idea, vs other techniques of controlling for batch effect. Lets say spike ins which are mostly just qc and library normalization. Or techniques like COMBAT which require good representation of your populations in your batches so that batch and biology of interest are not confounded.
Any insight is useful.
edit: I will be sequencing clinical samples.
Hi Carlo, It's been several years since you posted this, but I'd like to ask you a question about normalization of two batches of samples showing perfect confounding. A group of samples with treatment "A" was prepared and sequenced in a separate batch from treatment "B" group. From what I've read, there seems to be no good way to correct for batch effects when there is a totally confounding variable. I'd like to know if it is advisable for us to resequence treatment "A" group with one of the samples present in treatment "B" group so that we can use the new sample from treatment "B" group to normalize the old data from samples in treatment "B" group. Ultimately we would like to run differential expression analysis on the two groups which currently show perfect confounding due to either biological differences, sequencing batch, or preparation batch. Can you offer any suggestions, or do you think will we need to resequence both groups in one batch? Thank you in advance.
Yes, to perform sound differential expression analysis, a least one sample of "A" should be prepared/sequenced with at least one sample from"B". In theory, this will be sufficient to control for batch effect, although it is always best (but not always possible) to sequence both groups fully in one batch.