Question

RNA-Seq Normalization and Batch Correction

0

Entering edit mode

7.4 years ago

Kristin Muench ▴ 640

Hello,

I have a few questions on the topic of batch correction.

The pipeline for this data is currently: TopHat2 (alignment to genome reference) > htseq-count (count reads per gene) > DESeq2 in R

I would really appreciate any thoughts you can offer, or tips on how your lab does things!

My questions are:

How can you perform batch correction with sequencing day as a variable when not all samples were re-sequenced on both days?

Backstory: We had two batches of sequencing (Day 1 and Day 2). A few, but not all, libraries sequenced on Day 1 were re-sequenced on Day 2. Any suggestions for how I should perform Day1/Day 2 on this data? Is including "SequencingDay" into the design matrix of my DESeq2 object sufficient, or should I rely on SVA, or something else?

Related to above: how can you perform batch effect correction when you have multiple batch effects and not all possible permutations of variables are represented in the samples?

Backstory: We have some batch effects that unfortunately aren't distributed evenly across all samples - e.g., of 10 samples, say we have suspected effects of Genotype, Sex, and Treatment - but don't have an example of a (Genotype1 + Male + Treatment2) sample. Is the solution basically 'pick your favorite batch effect' and only correct for that? So, in the example above, make your design matrix Genotype + Treatment, and ignore the effect of Sex? Is there a better way?

How can I integrate spike-ins into my analysis pipeline?

In the alignment step, how do I make sure spike-ins are represented in the output file (i.e. gene counts), if there isn't a special version of the genome reference file? Does merely including spike-ins in our input data boost the accuracy of DESeq2's normalization algorithm enough? Or is there some other layer of normalization I should do when using spike-ins?

RNA-Seq R • 3.4k views

ADD COMMENT • link 7.4 years ago by Kristin Muench ▴ 640