Hello!
I am currently analyzing a dataset containing the following DESeq2 design:
sample group continuous_value
sample_a A 35
sample_b A 10
sample_c B 2
sample_d B 5
design(Experiment) <- formula(~ continuous_value + group)
Each sample belong to a group containing 5 individuals: Group A contains the WT samples and Group B the knock down samples.
For each sample a continuous value (in percentage) is associated. This value depicts the percentage of cells in this sample that are the one I'm interested in. In other words, each sample contain cells from the same cell type but only x% of them are the one that have the phenotype I want to analyse. Since the % of cells of interest varies from one sample to another I would like to normalize the results in consequence.
The question is the following: How DESeq2 handles these continuous values? Is this design the most appropriate?
I am afraid I am not sure to fully understand the DESeq2 vignette part that talks about it.
I already tested three approaches:
- With this %
- Without this %
- Transform the % into small number of bins as advice in the vignette. Unfortunately, I got the error:
Error in DESeqDataSet(se, design = design, ignoreRank) : the model matrix is not full rank, so the model cannot be fit as specified.one or more variables or interaction terms in the design formula are linear combinations of the others and must be removed
. Moreover, we currently don't have any biological information that could allow us to cluster those % into groups and so the cut-off between the groups are arbitrary.
Thanks in advance for your answers!
I strongly encourage you to find a local collaborator. It'll take some playing around with the data for someone to come up with an optimal solution.
When you have a significant nuisance covariate like this you pretty much have to include it somehow in the design for the results to be useful, so the second test you tried can be ignored. It's often the case that creating groups like you tried in test 3 is the simplest route, though as you've noticed you have to be fairly familiar with how the underlying statistics work to not have these groups confound the calculation of the group-effect that you actually care about. Sometimes it turns out that a simple transformation of the continuous values (e.g., with log2) provides more reasonable results, but again you really need someone familiar with messier designs like this to directly work with the data. In an ideal world, he/she can then tell you what was tried and why/how the best design was arrived at (since you'll learn a LOT from that process).
BTW, make some PCA plots and see how things group according to the covariate. Sometimes that's enough to figure out how to handle things.