Hi All,
I have 2 experimental condition and each condition has 18 samples. However, they are sequenced in 3 batches. 1 st batch 6 pair (disease & control), 2nd batch 6 pair (disease & control), 3rd batch 6 pair (disease & control).
Each batch done by different people over different times so I believe we have batch effect. As I am using DESeq2 for my analysis I set my model design as “Batch+Type”. I used removeBatchEffect from limma package and plotted PCA & did hierarchical clustering using spermann correlation to see which samples are outliers.(I attached the graphs below). I also used an alternative PCA &plotting (prcomp()
& fviz_pca_ind
) which actually confused me. The ggplot & fviz_pca_ind graphs look different from each other. Which one should I trust?
Another issue I am facing is 1st & 2nd batches are 76 bp paired end, however the last batch is 101 bp paired end.I did not add this into the design separately. Should I add it to the model design ? If so, how should I desing ? or Should I trim all fastq files to 76 bp to have same size ? What do you suggest?
I would like to do differential gene expression analysis and WGCA. So Should I remove any samples according to these PCA plots or just do analysis as I already included batch effect into the model design as "~Batch+Type"
I really appreciate if you give me some insight,
Thanks in advance,
DESeq2
ddsMat.pre<- DESeqDataSetFromMatrix(seMat.pre, sample.table.ERSvsPT.pre, design=~Batch+Type)
dds.pre<-DESeq(ddsMat.pre)
dds.pre <- dds.pre[ rowSums(counts(dds.pre)) > 1, ]
Normalization
rld.batch <- rlog(dds.pre, blind= FALSE)
rlogMat.batch <- assay(rld.batch)
Removing batch effect
design=model.matrix(~sample.table.ERSvsPT.pre$Type)
removedbatch=removeBatchEffect(rlogMat.batch, batch=sample.table.ERSvsPT.pre$Batch, design= design )
rld.removed<-rld.batch
assay(rld.removed)<-removedbatch
Plotting the PCA
before removing batch effect:
data.batch<-plotPCA(rld.batch, intgroup=c("Batch","Type"), returnData=TRUE)
data.batch$name<-colData(rld.batch)$Patient
percentVar.batch <- round(100 * attr(data.batch, "percentVar"))
ggplot(data.batch, aes(PC1, PC2, color=Batch, shape=Type, label= name)) + geom_text(size=6,hjust = 0, nudge_x = 0.05) + geom_point(size=2) + xlab(paste0("PC1: ",percentVar.batch[1],"% variance")) + ylab(paste0("PC2: ",percentVar.batch[2],"% variance")) + ggtitle("PCA of RLD_Type+Batch ") + coord_fixed()
after removing batch effect:
data.removed<-plotPCA(rld.removed, intgroup=c("Batch","Type"), returnData=TRUE)
data.removed$name<-colData(rld.removed)$Patient
percentVar.removed <- round(100 * attr(data.removed, "percentVar"))
ggplot(data.removed, aes(PC1, PC2, color=Batch, shape=Type, label= name)) + geom_text(size=6,hjust = 0, nudge_x = 0.05) + geom_point(size=3) + xlab(paste0("PC1: ",percentVar.removed[1],"% variance")) + ylab(paste0("PC2: ",percentVar.removed[2],"% variance")) + ggtitle("PCA after removeBatchEffect ") + coord_fixed()
Alternative PCA method : prcomp() & fviz_pca_ind
before.pca <- prcomp(t(rlogMat.batch), scale=T)
after.pca <- prcomp(t(removedbatch), scale=T)
quali.sup <- as.factor(colData(rld.readlength)$Type)
fviz_pca_ind(before.pca,habillage = quali.sup, addEllipses = TRUE, ellipse.level = 0.75) + theme_minimal()
fviz_pca_ind(after.pca,habillage = quali.sup, addEllipses = TRUE, ellipse.level = 0.68) + theme_minimal()
Before removing the batch effect:
After removing the batch effect:
Alternative PCA plot for Before Batch effect removal:
Alternative PCA plot for After Batch effect removal:
I really appreciate your insights. Thanks a lot Carlos.
So I do not need to remove the P5 &P6 which belong to the left hand side group in first two graphs.
I don't see a reason to remove them. Don't be too scared by the vertical distance between P5/P6 and the others points : the y-axis represents only 8% of total variance.
In general for WGCNA you should adjust for batch effects prior to performing the analysis (input should be a matrix containing normalized expression). In addition it is recommended to remove low count "genes" and apply a variance stabilizing transformation when using RNAseq as input (from WGCNA FAQ page: https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html).