DESeq2: Continous values
1
0
Entering edit mode
9.2 years ago
VHahaut ★ 1.2k

Hello!

I am currently analyzing a dataset containing the following DESeq2 design:

sample    group    continuous_value
sample_a    A    35
sample_b    A    10
sample_c    B    2
sample_d    B    5
design(Experiment) <- formula(~ continuous_value + group)

Each sample belong to a group containing 5 individuals: Group A contains the WT samples and Group B the knock down samples.

For each sample a continuous value (in percentage) is associated. This value depicts the percentage of cells in this sample that are the one I'm interested in. In other words, each sample contain cells from the same cell type but only x% of them are the one that have the phenotype I want to analyse. Since the % of cells of interest varies from one sample to another I would like to normalize the results in consequence.

The question is the following: How DESeq2 handles these continuous values? Is this design the most appropriate?

I am afraid I am not sure to fully understand the DESeq2 vignette part that talks about it.

I already tested three approaches:

  • With this %
  • Without this %
  • Transform the % into small number of bins as advice in the vignette. Unfortunately, I got the error: Error in DESeqDataSet(se, design = design, ignoreRank) : the model matrix is not full rank, so the model cannot be fit as specified.one or more variables or interaction terms in the design formula are linear combinations of the others and must be removed. Moreover, we currently don't have any biological information that could allow us to cluster those % into groups and so the cut-off between the groups are arbitrary.

Thanks in advance for your answers!

DESeq2 R Design • 4.3k views
ADD COMMENT
0
Entering edit mode

I strongly encourage you to find a local collaborator. It'll take some playing around with the data for someone to come up with an optimal solution.

When you have a significant nuisance covariate like this you pretty much have to include it somehow in the design for the results to be useful, so the second test you tried can be ignored. It's often the case that creating groups like you tried in test 3 is the simplest route, though as you've noticed you have to be fairly familiar with how the underlying statistics work to not have these groups confound the calculation of the group-effect that you actually care about. Sometimes it turns out that a simple transformation of the continuous values (e.g., with log2) provides more reasonable results, but again you really need someone familiar with messier designs like this to directly work with the data. In an ideal world, he/she can then tell you what was tried and why/how the best design was arrived at (since you'll learn a LOT from that process).

BTW, make some PCA plots and see how things group according to the covariate. Sometimes that's enough to figure out how to handle things.

ADD REPLY
0
Entering edit mode
9.2 years ago

Providing you have a good number of observations, what you're asking is: "Are there any correlations with normalised counts, relative to my continuous variable (and based on your design formula), regardless of Group?" If that's right, then first things first, you need to make sure that your continuous variable in your design matrix is numeric. You should also consider transforming the continuous variable, depending on the distribution of those values; log transform perhaps? You're most likely getting a rank error because your continuous variable is a factor in your design matrix, if that's not the case, then there are deeper issues with your underlying experimental design.

ADD COMMENT
0
Entering edit mode

Sorry if my question was not well formulated but I think you have not understood what I wanted to say.

I don't want to know if there are any correlation with the continuous variable regardless of the groups. I want to normalize the count matrix in such way that this variable number of cells doesn't impact on my comparison of the groups.

My main to goal is to extract the differentially expressed genes between the groups, knowing that I have a starting bias in my sample because they are not pure population. I thought that putting the percentage of cells in my design would do it.

Does it make more sense?

Anyway I thank you for your answer and will still go back to my design to verify if I have encoded my values as numeric and not as factors.

ADD REPLY

Login before adding your answer.

Traffic: 1660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6