Sorry for being the millionth person to ask questions about this topic, but I haven't been able to find a clear answer to my questions.
So I'm using DESeq2 to find DE genes between two tissues. I have 20 samples of tissue 1, and 20 of tissue 2, and I use a design of ~ Tissue, and that gives me results.
But these aren't just 20 random tissue 1 and 20 random tissue 2 samples. There are relationships and structure between the samples.
My samples were taken at 4 different time points, and I want time point differences to be accounted for when calculating if the genes are differential expressed. So I do a design of Time + Tissue, and that gives me much the same genes, but with much better p-values. That's how I should set it up, if my goal is just to look at tissue differences, correcting for time point effects? (I feel like Time:Tissue is not what I want, because I'm not interested in zooming into the differences in any one particular timepoint)
My 40 samples are also paired, There is a tissue 1 and a tissue 2 from each of 20 individuals. So I could do a design of Individual + Tissue to get DE genes between tissues, and this would account for differences between individuals? (separate question: Is this a good idea with 20 individuals for 40 samples?)
So I'd really like DE genes between tissues, taking into account variation introduced by time point and individual, but Time + Individual + Tissue won't work, I presume because each individual is present at only a single time point. I can make a new column, with time and sample concatenated together, and do concat + Tissue, but now I've lost my replicates, and I'm not sure that splitting this up into 20 distinct groups is what I really want.
I can cheat a bit and make new individual numbers, that are just 1-5, so now New + Time + Tissue is full rank, because it thinks the same individuals are present in all 4 time point groups. Is this the right answer...or just a pretty good one?
What I want is for the software to understand that each sample is part of three separate groups, and for it to remove the influence from the timepoint and individual group to make the influence of the tissue group sharper. Is there an approach I am missing?
You don't have three separate groups, only 2. If you account for
Individual
then you've already accounted forTime
. Honestly, just do that and be done with it.Telling the software that people 1-5 are present at all time points (so you can have
~Individual + Time + Tissue
) isn't going to gain you anything. Assuming these are human samples you're likely to have fairly high variance between individuals, so theIndividual
variance is going to be higher in this setup and I would worry that that'd make the variance for theTissue
coefficient similarly inflated...thus tanking your power. I mean, it's not like this stuff takes terribly long to run, so you can check, but that'd be my expectation.Yeah, I guess you are right. Using Individual + Tissue was giving me the highest p-values,so I guess that's a sign that it was filtering away the most extraneous influences. Thanks.
Hi, Devon,
I have a similar but more complicated problem. I have 94 samples, from 4 tissues, treated with two different chemicals and on 4 different timepoints. (4 x 2 x 4 x 3 - 2 missing data).
The problem is that these samples come from 10 different individuals and have two different genotypes. What I want to do is to account for and regress out the influence of genotype and individual.
I actually tried four models for each 4 tissue (because the tissue contributes to the largest variation and one of the tissues has much large variance, making two samples within this tissue looks like outliers even though they are normal when clustered with samples from the same tissue, and the dispersion estimates was not good when all 4 tissues were modeled). And the number of DEGs identified by the following models increase:
~ treatment + time + treatment:time
~ genotype + treatment + time + treatment:time
~ individual + treatment + time + treatment:time
And the DEGs modeled by “~ genotype + individual + treatment + time + treatment:time” was the same as the last one above.
I want to ask if I need to include the “individual”, or just the “genotype”.
many thanks! apologizes if it is confusing
Your primary issue is that you're very low on degrees of freedom. If you added in individual you would likely have none left. Ideally individuals act as reasonable replicates for the effect you're trying to model, genotype may or may not as well. In general it sounds like you needs vastly more samples. You may also want to look at the edgeR vignette (or maybe it's limma), since they have a nice example of compensating for individual without destroying the degrees of freedom.
Yes, I have too many variables. But that's the data I have and need to analyze. Is there any relationship between the degrees of freedom and the number of DEGs identified? It seems that I get more DEGs when I included the individual in the model.
I will check the vignette of edgeR and limma.
Thanks again