Hi all,
I am performing an RNA-sequencing experiment using primary human fibroblasts. The study is a simple case/control study with a treated and untreated condition.
I have begun my analyses by comparing case/control while disregarding treatment. When I don't include covariates I get 90 significant genes out of a count matrix of ~26k. These seem to make biological sense.
However, when I include age as a covariate, I suddenly get almost 9k significant genes. I have also generated histograms of the age distribution in the cases vs controls and they seem to be relatively similar.
It is also worth noting that HOX genes are a confounder in my experiment. However, I am unsure why they cropped up as differentially expressed when accounting for age but when I didn't include age in the model.
Would you have any ideas as to why this has happened?
Code below written in rstudio.
Analysis Without Covariate
dds <- DESeqDataSetFromMatrix(counts, meta, design = ~condition)
dds <- DESeq(dds)
contrast_condition <- c("condition", "case", "control")
#Extract results
res_table <- results(dds, contrast = contrast_condition, alpha = 0.05)
#Apply LFC shrinkage
resultsNames(dds)
res_table <- lfcShrink(dds, coef = "condition_case_vs_control", type = "apeglm")
res_table <- res_table %>%
data.frame() %>%
rownames_to_column(var = "gene") %>%
tibble()
#Extract significant results
sig_res <- res_table %>%
dplyr::filter(padj < 0.05)
Top Sig Genes No Covariate
Analysis with Age as A Covariate
#Scale and centre the age variable as numeric variables with high mean can impair the model fit
meta$age_months_scaled <- scale(meta$age_months, center = TRUE)
#Remove NAs from age$months as DESeq can't handle NAs
meta_na.rm <- meta[!is.na(meta$age_months), ]
counts_na.rm <- counts[ , meta_na.rm$sample_name]
#Run the analysis on DESeq dataset
dds_age <- DESeqDataSetFromMatrix(counts_na.rm, meta_na.rm, design = ~age_months_scaled + condition)
dds_age <- DESeq(dds_age)
#Define contrast
contrast_condition <- c("condition", "case", "control")
#Extract results
res_table_age <- results(dds_age, contrast = contrast_condition, alpha = 0.05)
#Apply LFC shrinkage
resultsNames(dds_age)
res_table_age <- lfcShrink(dds_age, coef = "condition_case_vs_control", type = "apeglm")
#No LFC shrinkage errors
res_table_age <- res_table_age %>%
data.frame() %>%
rownames_to_column(var = "gene") %>%
tibble()
#Extract significant results
sig_res_age <- res_table_age %>%
dplyr::filter(padj < 0.05)
Top Sig Genes With Covariate
It would probably be worth doing a bit of data exploration with e.g. PCA to better understand the factors contributing to the variance in your data. This becomes more important as you start adding covariates to a model.