Hello everyone,
I am working on RNA-seq datasets and the question I'm interested in is "Keeping the expression levels of gene X constant, which genes are differentially expressed between conditions A and B?".
I think one way to do this is by using the design ~ Expression + Condition
(where Expression
is the expression levels of the gene X as a continuous variable) followed by extraction of the Condition effect by lfcShrink(dds, coef = "Condition_B_vs_A", type = "apeglm")
. My questions:
- Is this a valid approach for my purpose? if not, why? and what would be a valid way to achieve this? If yes:
- How should the expression levels of gene X be transformed prior to including it in the design? Should I use VST counts? Are centering and scaling required?
Thanks in advance for any help and guidance.
Thank you very much for your responses jared.andrews07 and Thomas Neff
In my example, gene X is representative of a marker for a specific cell type (e.g., NK cells). By accounting for the expression of this gene, I aim to account for the differences in the abundance of this specific cell type so that the differential abundance of this cell type would not have an effect on the downstream observations (I'm using this in two different cases. In one case, the differential abundance of the cell type is due to batch effect, and in another, it's a biologic effect that, regardless, we don't want to affect the observations). I assume I can input this gene (or multiple validated markers of the specific cell type) as the
controlGenes
(right?), but would you please elaborate as to why including this gene (or maybe the estimated abundance of the cell type from a deconvolution approach) in the design does not work? I believe incorporation of continuous variables such as age to account for their effects is regularly used in the field, so what's different in this case?Also
If this is also not a good idea, then do you know of any reasonable way to handle such situations?
"Control" genes are still variable across samples even if relatively consistent. A 10-20% change of a very highly expressed gene may be relatively meaningless - but it still happens and anyone who's done much qRT-PCR will be well aware of this. Hence why using a panel of housekeeping genes for qRT-PCR is usually a good idea. The same is largely true in RNA-seq, and normalizing to a single gene (or a handful of them) that should be pretty identical can still have dramatic effects. The RUVseq paper does a good job of demonstrating this with spike-ins inserted into the samples at supposedly equivalent levels.
Sort and mix your samples in equal proportions of your problematic cell type vs "others" or probably more simply on the front-end, do single cell RNA-seq. Though the analysis for that is considerably more annoying and time-consuming. Deconvolution of your bulk samples via CIBERSORTx or such would indicate how great the difference in abundance actually is, which may provide some additional avenues.