Hi All,
I have 2 experimental condition and each condition has 18 samples. However, they are sequenced in 3 batches. 1 st batch 6 pair (disease & control), 2nd batch 6 pair (disease & control), 3rd batch 6 pair (disease & control).
Each batch done by different people over different times so I believe we have batch effect. As I am using DESeq2 for my analysis I set my model design as “Batch+Type”. I used removeBatchEffect from limma package and plotted PCA & did hierarchical clustering using spermann correlation (I believe it is better to use Spearman rather than the Pearson??) to see which samples are outliers. Another issue I am facing is 1st & 2nd batches are 76 bp paired end, however the last batch is 101 bp paired end.I did not add this into the design separately.
I would like to do differential gene expression analysis and WGCA. So I have multiple questions:
- Should I use
rlog()
for differential gene expression analysis andgetVarianceStabilizedData()
for WGCNA? or can I use same normalization for both? If yes, which one do you suggest ? - Should I remove any samples according to these PCA plots & dendogram or just do analysis as I already included batch effect into the model design as
~Batch+Type
. Or is it like for DGE analysis I do not to remove any sample but for WGCNA I should remove?
For both normalization type it looks like I have to remove the Patient-Sph4 as it look has distance from others in PCA plot also (the one at the middletop)
I really appreciate if you give me some insight,
Thanks in advance,
ddsMat.pre<- DESeqDataSetFromMatrix(seMat.pre, sample.table.ERSvsPT.pre, design=~Batch+Type)
dds.pre<-DESeq(ddsMat.pre)
dds.pre <- dds.pre[ rowSums(counts(dds.pre)) > 1, ]
datExpr0<- assay(dds.pre)
gsg = goodSamplesGenes(datExpr0, verbose = 5);
gsg$allOK
RLD
rld <- rlog(dds.pre, blind= FALSE)
rlogMat <- assay(rld)
save(rld, file="RLD_Pre_Type+Batch")
design=model.matrix(~sample.table.ERSvsPT.pre$Type)
removedbatch.rld=removeBatchEffect(rlogMat, batch=sample.table.ERSvsPT.pre$Batch, design= design )
write.csv(removedbatch.rld, file="removedbatch.rld_36patient")
rld.removed<-rld
assay(rld.removed)<-removedbatch.rld
par(mfrow = c(1,2));
cex1 = 1.5;
plotPCA(rld.batch, intgroup=c("Type", "Batch"))
plotPCA(rld.removed, intgroup=c("Type", "Batch"))
colData(rld.removed)$Type<-as.factor(colData(rld.removed)$Type)
colData(rld.removed)$Batch<-as.factor(colData(rld.removed)$Batch)
data.removed<-plotPCA(rld.removed, intgroup=c("Batch","Type"), returnData=TRUE)
data.removed$name<-colData(rld.removed)$Patient
percentVar.removed <- round(100 * attr(data.removed, "percentVar"))
p2<-ggplot(data.removed, aes(PC1, PC2, color=Batch, shape=Type, label= name)) + geom_text(size=5,hjust = 0, nudge_x = 0.05) + geom_point(size=3) + xlab(paste0("PC1: ",percentVar.removed[1],"% variance")) +
ylab(paste0("PC2: ",percentVar.removed[2],"% variance")) + ggtitle("PCA of RLD_pre_Type+Batch ") + coord_fixed()
mat.rld.removed<- t(assay(rld.removed))
rownames(mat.rld.removed) <- paste( rld.removed$Written.ID)
plot(hcluster(dist(mat.rld.removed), method= "spearman", link="average" ), cex=1.5, cex.main=2, main= "Hierarchical Clustering of RLD Normalized Count Data ")
dev.copy(png, "Sample clustering (hcluster) of RLD_Pre_Type+Batch.png")
dev.off()
VSD
vsd = getVarianceStabilizedData(dds.pre)
save(vsd, file="VSD_Pre_Type+Batch")
design=model.matrix(~sample.table.ERSvsPT.pre$Type)
vsd.removed=removeBatchEffect(vsd, batch=sample.table.ERSvsPT.pre$Batch, design= design )
colnames(vsd.removed) <- rld.removed$Written.ID
plot(hcluster(dist(t(vsd.removed)), method= "spearman", link="average" ), cex=1.5, cex.main=2, main= "Hierarchical Clustering of VSD Normalized Count Data ")
dev.copy(png, "Sample clustering (hcluster) of vsd_Pre_Type+Batch.png")
dev.off()
PCA plot after rld()
& removeBatchEffect()
:
Hierarchical Clustering Dendogram after rld()
& removeBatchEffect()
:
Hierarchical Clustering Dendogram after vsd()
& removeBatchEffect()
:
What type of data did you feed into
removeBatchEffect
? - Normalised counts, rlog transformed, vst transformed, voom'd?I tried both rlog &vsd. The 1st dendogram = the rlog transformed and batch removed, 2nd dendogram from the vsd transformed and batch removed.
Try Voom Transforming - That way you know that the
removeBatchEffect
function is getting what it expects.Hi Andrew, thanks for the suggestion; but I read that voom transformed data is not applicable for WGCNA analysis. So if we assume I did the vsd normalization, do you suggest me to remove any samples ?
Where did you read that? - Voom transforms RNA seq counts to "microarray-like" data that should be fine with WGCNA. If you want to batch correct, and you want to use Limma's
removeBatchEffect
(which I think is best in your current situation), then try and get the data to a point that Limma expects -Voom
is best in that regard. See point 4 hereThanks a lot; I will implement it into my analysis immediately.
Hi Andrew,
Sorry to disturb but may I ask a few quick questions ? While transforming the count data with voom()
1 - For design I am using both "~ Batch + Type", is it the correct approach ?
2 - Should I use TMM-normalization before using voom ()
3 - Should directly input the raw count data to voom ()
4 - Even though I deleted the rows with zero, my voom plots do not look like the plots in this paper. Am I doing something wrong ?
5 - Do you think it is better if I do differential gene expression analysis also using Limma's commands?
I really appreciate if you give me some insight,
Thanks in advance,