Question

Using rnaseq reference sample across batches

1

Entering edit mode

8.1 years ago

denalitastic ▴ 30

I am going to be doing RNAseq in batches. This is the situation: Each batch will have different samples, use the same machine, use the same library kit (but not the same prep) , and separated by a few months. Do people ever use a reference sample in these instances to normalize for any batch effects. For example create a large set of aliquots of RNA of the exact same sample or pool of samples and monitor how gene values change across batches. (Lets assume for the sake of argument degradation is not an issue, and that we spread all samples across all lanes of the machine, the timescales I am thinking are approximately 1 year, maybe it is not appropriate to make this assumption?).This paper below used control samples sequenced on two machines to control for platform. "Multi-platform analysis of 12 cancer types reveals molecular classification within and across tissues-of-origin". I haven't found many other papers that do this.

From the supplement:

We used a set of 19 colon samples that were sequenced on both platforms to estimate platform differences. A limitation of this approach is that the platform correction was restricted to the 16,116 (out of the 20,531 total) genes expressed in colon, defined as those with 3 or more reads. Upper quartile normalized RSEM data was log2 transformed.
Genes with a value of zero were set to the missing value after log2 transformation and genes were filtered if they had missing data in greater than 30% of samples. For the 19 colon samples sequenced on each platform, within each dataset the gene median were calculated. The difference between the GAII platform and the HiSeq platform was calculated and subtracted from the full set of GAII data. The corrected GAII set was merged with the HiSeq data set followed by gene median centering.

Is this strategy a good or bad idea, vs other techniques of controlling for batch effect. Lets say spike ins which are mostly just qc and library normalization. Or techniques like COMBAT which require good representation of your populations in your batches so that batch and biology of interest are not confounded.

Any insight is useful.

edit: I will be sequencing clinical samples.

RNA-Seq batch-effect • 5.2k views

ADD COMMENT • link updated 9 months ago by Ram 44k • written 8.1 years ago by denalitastic ▴ 30

score 3 · Answer 1 · 2016-12-14

3

Entering edit mode

8.1 years ago

Carlo Yague 8.9k

Do people ever use a reference sample in these instances to normalize for any batch effects ?

Yes, we do. I'm working in yeast so it might be a bit different, but everytime we sequence new mutants or new conditions, we resequence the control condition (in duplicates). And I can tell you that the batch effect from day to day library preparation is real and that this control is really needed.

Is this strategy a good or bad idea, vs other techniques of controlling for batch effect. Lets say spike-ins ?

Spike-ins are good to normalize for global effects. However, batch effects are often not global and can affect gene expression differently depending on gene length, G-C content, transcript stability, ...

Perhaps the best practice would be to use both spike-in and control condition. I don't know about COMBAT though.

ADD COMMENT • link 8.1 years ago by Carlo Yague 8.9k

0

Entering edit mode

Hi Carlo, It's been several years since you posted this, but I'd like to ask you a question about normalization of two batches of samples showing perfect confounding. A group of samples with treatment "A" was prepared and sequenced in a separate batch from treatment "B" group. From what I've read, there seems to be no good way to correct for batch effects when there is a totally confounding variable. I'd like to know if it is advisable for us to resequence treatment "A" group with one of the samples present in treatment "B" group so that we can use the new sample from treatment "B" group to normalize the old data from samples in treatment "B" group. Ultimately we would like to run differential expression analysis on the two groups which currently show perfect confounding due to either biological differences, sequencing batch, or preparation batch. Can you offer any suggestions, or do you think will we need to resequence both groups in one batch? Thank you in advance.

ADD REPLY • link 4.3 years ago by gatollefson • 0

0

Entering edit mode

if it is advisable for us to resequence treatment "A" group with one of the samples present in treatment "B" group so that we can use the new sample from treatment "B" group to normalize the old data from samples in treatment "B" group.

Yes, to perform sound differential expression analysis, a least one sample of "A" should be prepared/sequenced with at least one sample from"B". In theory, this will be sufficient to control for batch effect, although it is always best (but not always possible) to sequence both groups fully in one batch.

ADD REPLY • link 4.3 years ago by Carlo Yague 8.9k

score 3 · Answer 2 · 2016-12-14

3

Entering edit mode

8.1 years ago

Michele Busby ★ 2.2k

I like Carlo's answer a lot.

One thing to consider is what effects you will be blocking for by adding a control. A control sample in every run is often a good idea and yes, we have seen it. There are some built into the designs of the GTex experiments, for instance. But it is important to understand that they are of varying value depending on what independent variable you are controlling for (blocking).

You don't mention where you are getting your samples. This is important in understanding whether you will be blocking for technical sequencing effects or biological effects.

In Carlo's yeast experiments, I suspect that he may be growing up the yeast in flasks where he will have different batch effects because of the biological variability of the yeast in flasks, e.g. how close they are to a heat source, variability in the feeding schedules, etc. This variability can be really big (i.e. Busby et al. 2011). So growing up a standard yeast with the other yeast will control for that.

However, if you are getting e.g. human samples from a clinic, sequencing something such as K562 with every batch will only give you clues about the variability in the sequencing run. This is useful and we do it for large experiments with dozens of precious samples in a single run because a low quality sequencing run is a disaster.

BUT, the variability introduced by library prep + sequencing is usually low because the variability introduced by the quality of the sample handling and the differences in the biology are usually big.

So in these cases a control for sequencing may not add that much information beyond controlling for disasters. So then you have to weigh the costs of the control vs the cost of adding another replicate (though this sounds like a good size experiment - yay!).

We have also used spike ins, such as ERCCs. These are good if you expect different levels of complexity across the samples (e.g. because of degradation). They also give additional information to help with normalizing samples against one another.

If your samples are going to be degraded UMIs might be worth looking at too.

None of these things makes analysis simple-it is always hard in big experiments. But they all add information. They cost money in the wet lab but remember that they save money in the bioinformatics analysis. This is the better place to spend money (though it puts us out of work!).

ADD COMMENT • link 8.1 years ago by Michele Busby ★ 2.2k

1

Entering edit mode

Good job putting things in perspective.

In Carlo's yeast experiments, I suspect that he may be growing up the yeast in flasks where he will have different batch effects because of the biological variability of the yeast in flasks

You are right to point this out. What we call "biological replicates" in yeast are basically clones. This is very different that replicates in medical research where the variability between subjects can be much higher than the variability introduced by batch effects. But I feel that even if there are other greater sources of variability, one should always control for batch effect somehow.

ADD REPLY • link 8.1 years ago by Carlo Yague 8.9k

1

Entering edit mode

We should also add that the best thing you can do is block your independent variables in the batch.

i.e. don't sequence all your cancers in one batch and all your normals in another batch (or whatever your tests and controls are). Put some of each condition in each batch if possible.

ADD REPLY • link 8.1 years ago by Michele Busby ★ 2.2k

0

Entering edit mode

I am doing sequencing on clinical samples so in my mind it would be more controlling for run variability. A few more points, In a sense I do not know what the cases/controls are a-priori. I do have 4 subpopulations and for the sake of argument we can say they have relatively similar prevalences, i.e. when I get a sample from a clinician there is ~25% chance of getting one from each group. The reason I ask this question is I actually did run a small set of samples already in 2 batches previously. When i looked at them by a PCA using RNA expression levels and there seemed to be a slight batch effect on PC2 (not 100% separation). Since the sets are small it could be biology, but it got me thinking what would I do to mitigate this if it was a worse situation. As far as blocking my best bet is to sequence as many as I can together to get a representative swath, so that as you suggested Michele my biology and batch are not confounded, this is also helpful/necessary for batch correction algorithms such as COMBAT.

My next question would be what would be the next bioinformatics step with the extra sequenced referenced samples. Is the median subtraction method used by the Stuart group in the paper I linked a good strategy? Or are there other better methods and algorithms that you use. And do you think it is necessary to use duplicate reference samples Carlo. I greatly appreciate the help you guys have provided.

ADD REPLY • link 8.1 years ago by denalitastic ▴ 30

0

Entering edit mode

Is the median subtraction method used by the Stuart group in the paper I linked a good strategy?

I'm not sure how good this strategy is. Perhaps Michele or others can tell you more about it. In my case I use DESeq2 for downstream analysis and batch effect can be taken into account simply by including it in the model. More complex batch corrections can be made using sva + DESeq2.

And do you think it is necessary to use duplicate reference samples ?

In your case, probably not. Especially if you have many batches.

ADD REPLY • link 8.1 years ago by Carlo Yague 8.9k

0

Entering edit mode

Lets say I use multiple reference samples across the batches. In my limma or COMBAT model is it then ~ treatment + technical_replicate_or_not + batch?

ADD REPLY • link 8.1 years ago by denalitastic ▴ 30

0

Entering edit mode

IMO, "technical_replicate_or_not" is not a factor so the model is only ~treatment + batch. Your technical replicates will have exactly the same levels, thats why they are replicates :)

ADD REPLY • link 8.1 years ago by Carlo Yague 8.9k