Hi,
First of all, I apologize for a long post .
I am analyzing miRNA sequencing data generated from human samples. I have 10 paired samples - control and treated. These samples were sequenced in two batches - 4 pairs in one batch and 6 pairs in another batch. The objective is to identify differentially expressed miRNAs between control and treated groups. The experimental design is as follows:
Sample_ID Condition Patient Batch
IonXpress_RNA_001 Healthy A 1
IonXpress_RNA_002 Treated A 1
IonXpress_RNA_003 Healthy B 1
IonXpress_RNA_004 Treated B 1
IonXpress_RNA_005 Healthy C 1
IonXpress_RNA_006 Treated C 1
IonXpress_RNA_007 Healthy D 1
IonXpress_RNA_008 Treated D 1
IonXpress_RNA_011 Healthy E 2
IonXpress_RNA_012 Treated E 2
IonXpress_RNA_017 Healthy F 2
IonXpress_RNA_018 Treated F 2
IonXpress_RNA_021 Healthy G 2
IonXpress_RNA_022 Treated G 2
IonXpress_RNA_023 Healthy H 2
IonXpress_RNA_024 Treated H 2
IonXpress_RNA_025 Healthy I 2
IonXpress_RNA_026 Treated I 2
IonXpress_RNA_027 Healthy J 2
IonXpress_RNA_028 Treated J 2
I tried to analyze the data using DESeq2. I checked to see if the samples cluster based on batch and yes, the PCA plot clearly showed two distinct clusters - batch 1 and batch 2. I used variance stabilized counts for this purpose. Then, I removed batch effects using assay(vsd)<-limma::removeBatchEffect(assay(vsd), vsd$Batch)
. PCA plot now showed as a single cluster and not as two distinct clusters.
To identify differentially expressed miRNAs, I used <-DESeqDataSetFromMatrix(countData=cts, colData=coldata, design=~Patient + Condition)
. Since this is a paired sample analysis, I used the variable 'patient' in my design. I did get a list of differentially exprssed miRNAs. However, when I tried to include "batch" in the design, I am getting the following error message -
invalid class “DESeqDataSet” object: the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others
I am trying to figure out the reason for this error - is this because the conditions (healthy and treated) are overlapping between both the batches?
Is there a way that I can rectify this ?
Should I change my design ?
If I do not correct for batch effects or remove the batch effects, will it drastically change my output? I understand that batch effects can lead to spurious results. However, I am asking this because ~ 75% of the differentially expressed miRNAs obtained from my analysis have got validated in RT-PCR.
I understand that the input data has to be raw counts for DESeq2 but is there a way that I can do a differential expression analysis for batch corrected dataset and compare the results with the uncorrected dataset? This is for learning purpose.
Thank you very much
Best regards,
Preethi
Thank you very much for your reply. I did do a PCA and the difference looked glaring. Both the batches separated as two clusters. I am not sure how to share the image here. But yes, they separated as two clusters and when I tried PCA plot using batch effects removed variance stabilized counts, the batches merged as one cluster.