Hello,
I have data from a Tn-Seq experiment. We have a barcoded knockout library. We measured the abundance of each knockout at time zero (by taking an aliquot from the common library), then did a 2x2 factorial experiment with 5 replicates per level.
I understand how to use (the excellent) DESeq2 to analyze the factorial structure of my data, by having a two-factor colData data.frame and using a formula with an explicit interaction (e.g. ~ A + B + A:B). This is very useful for telling me how the treatments differ in their effect on relative abundance.
However, it does not incorporate the T0 data. If I could incorporate that, it would also give me information on absolute fitness, which would be very useful for my study.
I have an alternate in-house pipeline that first calculates fitness by finding log2 changes from T0 for each replicate, then just does a two-factor lm on each gene's fitness with subsequent BH corrections, but I think this in-house pipeline doesn't do a great job normalizing gene counts and is giving spurious results.
Is there a way I can do this in DESeq2? I apologize for any lack of clarity. Thanks!
Hi Chaco001, could you please clarify the following points so that we can provide accurate answers ?
Thank you Carlo.
Yes, the data is Tn-Seq, and so it is abundance of transposon insertion, or equivalently the relative frequency of that mutant in the population. It is true that historically this is been treated in just a binary fashion (using a threshold to call things essential / non-essential), but recently groups have been using changes in the insertion frequency across treatments / times to quantitatively assess fitness of the knockout mutants (see e.g. Morin et al https://elifesciences.org/articles/37072). The general idea is that if a given insertion mutant has higher-than-expected representation, that KO has higher fitness than the WT. And therefore the reverse is also true. The typical idea is to do a log2-fold change between a T0 (before the experiment) and T1 (after the treatment) for each gene (averaged in some way across insertions), and use this change as a quantitative estimate of fitness.
In our case, the experiment is growing the Tn-Seq library in the presence / absence of two competitor species. So there are four treatments in a factorial design: the library grown alone, the library in the presence of competitor A, the library in the presence of competitor B, and the library in the presence of competitors A and B. There is also the common T0.
In my mind, applying a tool like DESeq2 to just the final data (without the T0 data) tells us something about how relative insertion # (proportional to relative fitness) changes in a factorial way due to the presence of these two species. So perhaps a KO has neutral fitness when competitor A is present, reduced fitness when competitor B is present, and perhaps there is an interesting interaction where fitness is again neutral when both competitors are present. This would show up in DESeq2 as no significant main effect of species A, (probably) a significant effect of species B, and definitely a significant interaction.
Hopefully this helps, thank you for the clarifying questions.
Thanks for your detailed clarification. So factor A is the mutation (knock-outs), factor B is the presence/absence of competitor, the time (T0/T1) is currently not taken into account in the model.
To get back to your original question:
Absolutely essential genes will be devoid of reads, regardless of the condition so even if you add T0 data in a linear model, you will just be able to identify genes that affect fitness differently in T0 vs (T1NoCompetition), so the time effect. I'm not saying that T0 can not be used to identify absolutely essential genes, just that differential analysis (e.g., DESeq2) is probably not the right tool for that: differential is always relative to a control, it can never be absolute.