Dear all,
First of all, I would like to inform you that I'm new in RNA-seq analysis and the DEseq2 package. Also, I have (very) basic knowledge in statistic, so my apologies if I'm asking naive questions :)
We would like to analyse different cell population that we isolated from different samples/environnement (blood, ascites, tumor) from different patients. RNA-sequencing was done in bulk. Because these data were generated in the context of a collaboration between several research groups, all the cells were not isolated from the same lab. I would like to test this parameter of course.
The idea in my design is the following: because I expect difference between cell type (of course) and conditions (the environnement), I've created a new column in my annotation object, which combine (paste0) the column cell_type and condition. In brief, I will consider "gMDSC from blood" as a different cell population than "gMDSC from ascites".
Here's a exemple of my annotation df
cell_type cond origin group
CA.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood
DE.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites
DE.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood
DO.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood
FR.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites
FR.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood
FR.gMDSC.Spleen gMDSC Cancer_Spleen 1 gMDSC_Cancer_Spleen
KD.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites
KD.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood
NO.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites
NO.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor
ON.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood
ON.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor
RE.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood
RE.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor
RI.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood
RI.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor
SH.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor
TI.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites
TI.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor
A01.gMDSC gMDSC Ascites 2 gMDSC_Ascites
A03.gMDSC gMDSC Ascites 2 gMDSC_Ascites
. . .
With sample names put as rownames. 1, 2, 3 and 4 are the 4 levels of my "origin" factor, and correspond to the different research group that isolated the cells
The way I understood the Deseq2 design formula, is "you choose the factor you want to use for comparaison in your analysis (the last factor), while puting the factors you want to "control" first. I guess control here mean "taking into account the variability due to this factor while analysing DEG for the factor of interest".
Here was my formula:
dds <- DESeqDataSetFromMatrix(countData = cnt,
colData = annot,
design = ~ origin + group)
Unfortunately, I got this error message:
"Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed. Please read the vignette section 'Model matrix not full rank': vignette('DESeq2')"
If I remove the "origin" in my design formula, the script runs fine. But I feel that I miss something quite important there.
So I'm quite lost here...Am I going in the good direction for this kind of analysis (compairing cell population) or am I completely wrong?
Thanks in advance for your help, and sorry if I forgot to put some important information in the thread, but do not hesitate to ask them :)
Chris
Your origin column appears to encode the same info as the group column, doesn't it?
Sory, ignore that comment, I was confused by the alignment in your data frame
Is there a level of "group" that all come from a single origin, or a research centre that only provided samples of a single type?
Hi, thank you for your time
No, for each level of "origin", there are at least two level of "group" :)