I used DESeq2 to process RNA-seq data from different sources. And I found harsh batch effect when plotted PCA (different shapes of the figures represent 3 different batches, for example, ctr and PH.7d from different batches cluster apart):
I tried to remove it using limma package as described here:
colData
sample condition batch
1 100 PH.7d 1
..........
7 75 ctr 1
8 SRR5035380 hblast.10.5 2
..........
25 SRR5035397 hblast.18.5 2
26 SRR8437299 ctr 3
..........
37 SRR8437324 PH.7d 3
vsd<-vst(dds)
assay(vsd)<-limma::removeBatchEffect(assay(vsd),vsd$data1)
data2<-plotPCA(vsd, intgroup=c('condition','batch'),returnData=T)
data2<-as.data.frame(data2)
percentVar<- round(100*attr(data2,'percentVar'))
plot2<-qplot(PC1,PC2,color=condition,shape=batch,data=data2)
However, there is no changes when I plot the results:
What am I doing wrong?
Also, I tried to remove batch effect using design in DESeq:
ddsB=DESeqDataSetFromMatrix(countData = countData,colData = colData, design = ~batch+condition)
I'm getting this error:
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Can somebody help me to solve it?
Thanks in advance!
Are you sure that
vsd$data1
corresponds to the vector encoding the batch variable? Seems to me it should bevsd$batch
.It looks like batch 2 doesn't contain any of the groups in batch 1 and 3, therefore it is not possible to correct for that batch. Are you sure there is at least one overlapping group in batch 2, that is also found in batch 1 and 3?