Hello you all,
I have RNA sequencing data from two different runs. In the first batch I have samples from 3 groups (A,B,C) and in the second one I have samples from 3 groups (A,C,D). My PCA data shows samples clustering by group but then separated by batch (and the batch effect is much stronger than group differences).
So you get an idea, this is the distribution of the samples
group<-factor(c(rep("A",each=12),rep("B",7),rep("C",9),"A",rep("C",6),rep("D",16)))
batch<-factor(c(rep(1,28),rep(2,23)))
design <- model.matrix(~0 + group + batch)
I have two main questions, first, can I find a vector/value/something using samples from group A and C in both batches and then use this to compensate for the batch effect of all samples in batch 2 (A, C and D)?
And second, I have come up with a signature of different genes to discriminate between A, B and C in batch one, so I need to compensate for batch effect only in Batch 2 to validate the initial signature.
Is all of this possible? I have been playing with limma package but I am not succeeding.
The PCA (MDS):
Thank you very much in advance,
Jenn
Hola Jenn, how can you define 'not succeeding' [with limma]? Can you show the PCA bi-plot with percent variation explained for each PC? Why are you using limma if this is RNA-seq - I assume limmas / voom?
Probably, yes, but this should be modeled by including
batch
in the design formula. However, I am aware that, for example, group D is only in batch 2. Is group D critical to your analysis?Hi Kevin, By not succeeding I mean that I have not been able to get anything out of this. My code did not work and because my endgoal is not doing differential expression analysis, but obtaining a "corrected" read count table and I couldn't find any way of getting this. Also, repeating the PCA analysis (edited into the original post) I did not see any changes and I have been asking around but batch-effect correcting does not seem to be very popular.
I am using limma/voom because for what I have read online they are better than combat. And since I am not doing any diff expression analysis, why does it matter if this is RNA/DNA/proteins to select the tool? To answer your last question, yes, D is quite important as we want to validate the initial signature in an independent group.
Thanks!!
It may help to explain what are these groups, as understanding the biology can be important. I understand that you may not want to share, though. For example, I still cannot gauge, from where I am sitting, the full significance of sample group D in the context of your work. From what I can see, group D should be left out of the initial analysis but included in [perhaps] a meta-analysis of the initial results. Otherwise, you can play around with
removeBatchEffect
from limma. I 'never' directly modify my data for batch, though, so, I dont know if there is enough overlap between batch 1 and batch 2 for group C such that any correction can be done on D.So, using LASSO, I have constructed a signature of genes that can discriminate between patients with different diseases (A,B) from healthy controls (C). This signature is "tailored" to the first, training (= batch), cohort. The goal was to validate this signature in a second cohort (C) and see if it was possible to extrapolate these results to other types of diseases (D). While all controls in first and second batch cluster together separately, they are still quite separated by batch. Since samples from A and C in both batches seem to behave similarly (they cluster together), I thought it would be possible to find a way to use these samples to compensate for the batch effect for the whole second cohort. Hope it helps and makes sense. And thanks for you time.