Hi all,
I have RNAseq data (read count) of 96 mouse primary tumors with 15 different genotypes. These 96 samples are sequenced in 10 different days, however most of the data with the same genotype are sequenced at the same day. I am afraid if I do batch correction for sequencing day I also loose biological differences that exist across different genotypes. Any suggestion?
This is my script : After batch correction I see a lot change in the PCA plot
dds <- DESeqDataSetFromMatrix(as.matrix(all), colData, design = ~ Batch)
vsd <- vst(dds, blind = F)
plotPCA(vsd, "Batch")
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$Batch)
plotPCA(vsd, "Batch")
Part of colData:
Genotype condition Batch
1 A primary 2017-06-29
2 A primary 2017-06-29
3 A primary 2017-06-29
4 A primary 2017-06-29
5 A primary 2017-06-29
6 AK primary 2017-11-09
7 AK primary 2017-11-09
8 AK primary 2017-11-09
9 AP primary 2018-04-18
10 AP primary 2018-04-18
11 AP primary 2018-04-18
12 AKP primary 2019-09-12
13 AKP primary 2019-09-12
14 AKP primary 2019-09-12
I also look at these questions:
DESeq2, batch effect correction, multiple conditions
Batch effect problem DEG, DESseq2
But still not sure what should I do, I really appreciate any help!
Many thanks swbarnes for your prompt reply! Yes, they are all primary tumors but with different genotypes.
Sorry I don't get your question. I named the headers. What they should be?
You cannot make use of a column where every single sample has the same value. There is no point in it being there.
You cannot get rid of or account for batch effect in the dataset you posted, because it is deeply confounded with genotype. You can't make use of it, except as a guide to which genotype comparisons aren't confounded by batch, and which ones are.
However, if 1) All the RNA was extracted on the same day 2) All the libraries were prepped on the same day 3) the dates really are just the instrument run date, you can safely ignore that date, because running libraries on different days does not cause a batch effect.
If I ignore that date, is it correct to add only genotype to the design formula to account for its effect? This script is correct for normalizing the data?
I really appreciate your time and help!
That command line doesn't normalize anything. Normalizing doesn't take your design into account at all. But ~ Genotype is the only design you should be using with that colData.
Oh, the second line of my script was left, sorry. I edited my post. Thanks!