Hi everyone,
My lab has differentiated iPSCs cell lines and I need to do a bioinformatic analysis to try to understand how close they are from the real organ made of these cells. To do this I've gathered publicly available RNA-seq from this tissue both fetal and adult. I'm now doing some exploratory analysis with PCA and dendograms but I'm struggling with some issues.
- Constructing the DESeqDataSet
My coldata table (reduced in samples size) looks like this
ID Study Stage Sample1 Lab IPSCs Sample2 Lab IPSCs Sample3 Lab IPSCs Sample4 Study1 Fetal Sample5 Study1 Fetal Sample6 Study2 Fetal Sample7 Study2 Fetal Sample8 Study2 Fetal Sample9 Study2 Fetal Sample10 Study2 Fetal Sample11 Study2 Fetal Sample12 Study3 Adult Sample13 Study3 Adult Sample14 Study3 Adult Sample15 Study4 Adult Sample16 Study4 Adult Sample17 Study4 Adult Sample18 Study4 Adult Sample19 Study4 Adult Sample20 Study4 Adult
I thought the best approach would be to test for Stage (iPSCs/fetal/adult) controlling for the effect of Study (the fact that the samples come from different experiments)
dds <- DESeqDataSetFromMatrix(countdata, coldata, design= ~ Study + Stage)
However, I always come across (after trying and retrying in different ways) with the error:
Error in checkFullRank(modelMatrix):
the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.
I don't see any linear combination. I thought the problem was our own samples (since they are the only iPSCs), however I've tried performing this without them and the error persists. So I think I'm really missing something. I've read potentially every single post from people facing the same problem, also read the tools' vignette, but still can't figure out what's wrong with my design and how can I solve the issue.
- Accounting the batch effects for visualization purposes
I'm guessing limma::removeBatchEffects for Study would do the trick but would appreciate any hint on this topic too. Should I try sva and see if there is any other batch effects I should take into account?
I'm really sorry if these are really basic questions. I'm very new to bioinformatics and have been trying to find my way out on my own but sometimes I get stucked, specially with stuff involving statistics because of my lack of foundation on this.
I really appreciate any input you can give me. Thank you so much in advance!
Study is confounded by Stage (Lab has no replicates in Adult or Fetal). This is a typical situation where your experiment is fully confounded. You might eliminate batch within Fetals and within Adult but that's it. You would need samples of each Study in each Stage which you don't have. Your samples are going to (probably) cluster anywhere regardless of the true biology due to the batch effects. That is a common limitation, you cannot simply collect random studies and expect to compare them, this is unfortunately not how it works.
@ATpoint thanks for your input and time. However, as I mentioned, I also tried to perform this with the adult and fetal tissues only, and still got the same error which leads me to think there is something else wrong apart from my own experiment being confounded. Do you have any advice regarding what my approach should be as unfortunately that's the data I have to work with?
No, I do not think that there is much you can do with these data. Maybe manually scan for some marker genes that you may know that they could be candidates to characterize each stage, then check whether the Fetal and Adult indeed express the genes highly, then check expression level in your data, eventually confirm by qPCR on your RNA.