Question

DESEQ with time series data on mutant vs. wild-type?

1

Entering edit mode

3 months ago

Megan ▴ 50

Hi,

I want to know if I am on the correct track with this idea. I am looking to see if there are differences in circadian rhythm genes with a control group & a knockout group at two different time points-Morning & Night.

My colData is

 "colData <- tibble(
  id = c('HW2_1', 'HW2_2', 'HK2_1', 'HK2_2', 'HK2_3', 'HW14_1', 'HW14_2', 'HK14_1', 'HK14_2', 'HK14_3'),
  condition = c('WT', 'WT', 'KO', 'KO', 'WT', 'WT', 'WT', 'KO', 'KO', 'KO'),
  time = c('Day', 'Day', 'Day', 'Day', 'Day', 'Night', 'Night', 'Night', 'Night', 'Night')
)" and my design for this is 

"dds <- DESeqDataSetFromMatrix(countData = Count_Data, 
                              colData = colData, 
                              design = ~ time + condition + time:condition)"

If I wanted to identify DEGs across these time points between the WT & mutant, would I then use the

 results(dds, contrast = list('timeNight.conditionWT')),

where from what my undestanding of this interaction term is the "timing effect different across genotypes"

DESEQ • 1.8k views

ADD COMMENT • link updated 11 weeks ago by mazegriff ▴ 100 • written 3 months ago by Megan ▴ 50

3

Entering edit mode

Yes, this interaction term would capture genes where the difference between day and night is altered as a consequence of genotype. Note though that depending on your question you might need an additional filtering for the baseline of gene expression. Interactions can be tricky, for example the difference between day and night might be much bigger (fold-change wise) in WT than KO, but the overall expression level of the gene might be much higher or lower in this genotype. Depending on your question you might want to filter genes genes being (or not) differentially-expression as a consequence of genotype alone, or check for meaningful DEG patterns by subjecting the genes to hierarchical clustering and heatmap visualization. The interaction alone only cares about the fold change, not the overall expression level difference between genotypes.

ADD REPLY • link 3 months ago by ATpoint 86k

0

Entering edit mode

So, would it be a good idea then to look at genes in the KO between the two different time points to see a difference exists?

ADD REPLY • link 3 months ago by Megan ▴ 50

1

Entering edit mode

I'm not sue if this is super informative, but you could try merging the two factors into a single factor:

colData$group <- paste0(colData$condition, "_", colData$time)

And then follow the paper below:

conversion to a single factor from A guide to creating design matrices for gene expression experiments

ADD REPLY • link 12 weeks ago by jkim ▴ 190

0

Entering edit mode

Interesting--this would be a similar and simpler method to analyze interaction (e.g., effect of time interacting with genotype). Here's an excerpt from DESeq2's documentation about this approach:

Using this design is similar to adding an interaction term, in that it models multiple condition effects which can be easily extracted with results. Suppose we have two factors genotype (with values I, II, and III) and condition (with values A and B), and we want to extract the condition effect specifically for each genotype. We could use the following approach to obtain, e.g. the condition effect for genotype I:

For Megan's use case, it could look like this:

dds$group <- factor(paste0(dds$time, ".", dds$genotype))
design(dds) <- ~ group
dds <- DESeq(dds)
resultsNames(dds)
results(dds, contrast=c("group", "Night.KO", "Night.WT"))

ADD REPLY • link 12 weeks ago by mazegriff ▴ 100

0

Entering edit mode

Yes, I find that much more intuitive. When we include an interaction term in the model, the interpretation of the coefficients is different from a model without an interaction term. I have no idea why, but that's what the vignettes states.

The key point to remember about designs with interaction terms is that, unlike for a design ~genotype + condition, where the condition effect represents the overall effect controlling for differences due to genotype, by adding genotype:condition, the main condition effect only represents the effect of condition for the reference level of genotype (I, or whichever level was defined by the user as the reference level). The interaction terms genotypeII.conditionB and genotypeIII.conditionB give the difference between the condition effect for a given genotype and the condition effect for the reference genotype.

ADD REPLY • link 12 weeks ago by jkim ▴ 190

0

Entering edit mode

Yes, if you want to compare one subgroup to a second subgroup, do it this way. You can make the same comparison with the interaction design, but it's way less easy to read. This way is clear.

Interactions are for when you want to look at all 4 groups together; find genes where the different genotype causes the day to night change to be different. (e.g a gene which doubles expression from day to night in the WT, but triples expression from day to night with the KO)

ADD REPLY • link 12 weeks ago by swbarnes2 14k

0

Entering edit mode

This is what I am trying to do above where i could find genes would shift between day and night because of the genotype alone.

ADD REPLY • link 12 weeks ago by Megan ▴ 50

score 1 · Answer 1 · 2024-10-01

Hi Megan,

Yes the statement "interaction term is the 'timing effect different across genotypes'" is accurate. With your time series analysis across two cohorts (genotypes), the interaction term would capture interaction of condition and time (i.e,. if gene expression over time differs between genotypes). If your baselines are WT (for genotype) and Day (for time), then the combined coefficients you may be interested in would be the following:

timeNight.conditionKO

This interaction effect captures the differentially expressed genes (DEGs) that are significantly different across time (Night relative to Day) between genotypes (KO relative to WT); in other words, these genes have significantly different expression from Day to Night in KO mice relative to Day to Night in WT mice (+ log2FC would be significantly up-regulated at Night vs Day in KO relative to Night vs Day in WT).

If you were interested in the main effect of time (Night relative to Day) in WT, the comparison would be something like this, given the baselines above:

results(dds, name="time_Night_vs_Day")

DESeq2 uses a generalized linear model (GLM) with coefficients/effects like main effects and interaction effects (listed in the design) to fit the data to generate predictions. For instance, if I had two treatment cohorts (Treatment factor, with two levels TreatmentA and TreatmentB) and three time points (Time factor, with three levels D0, D10, and D20), and I was interested in DGE across cohorts (tests for the main effect of condition on gene expression), time groups (tests for the main effect of time), and across time and condition (i.e., if the effect of time differs by condition), then the design formula may resemble this:

design = ~ time + condition + time:condition

time:condition models time-specific condition effects or how the effect of time varies between genotypes. This can determine if the effects of one factor (i.e., time) depends on the level of another factor (i.e., genotype).

Related to ATpoint's response, a significant interaction term here may capture DEGs in Day vs Night between WT and KO, not accounting for differences in baseline expression levels between genotypes. There are scenarios that could complicate interpreting an interaction effect if no filtering is done.

Related to your follow-up: yes it would be a good idea to examine the DEGs in the KO genotype between time points. Assuming the above baselines: identifying DEGs between time points in genotype KO samples, you can 1.) find how GEX is changed over time within the KO genotype group, and 2.) determine if the interaction of time and genotype is driven by changes in KO at baseline time by finding the intersection of DEGs between these test results. If it fits with your analysis and appears in exploratory analysis (like hierarchical clustering), you could remove the DEGs due to the main effect of genotype alone at baseline time (Day) when investigating interaction effects independent of baseline genotype differences. For instance: if your research objective is to determine DGE due exclusively to the interaction of time and genotype, then you may want to specifically identify DEGs due to genotype alone (those found in the below contrast) and then find the non-intersecting DEGs found with the interaction effect.

condition_KO_vs_WT

Also, it's recommended to remove genes below a minimal count level across a designated number of samples.

Hope this is helpful.

Regards, Maze

score 1 · Answer 2 · 2024-10-03

1

Entering edit mode

12 weeks ago

Megan ▴ 50

Hi Maze,

this does help a lot! I tried looking at the different constrasts and am a little concerned that there may be something wrong how I processed my data. I looked at the constrats between the WT vs. the KO and noticed that the KO gene was not included in the list of DEGs looking at genotype alone. Is that something to be concenred about?

ADD COMMENT • link 12 weeks ago by Megan ▴ 50

0

Entering edit mode

Hi Megan,

Good question. I'd agree and say this is an interesting observation that could likely cause issues with analysis:

The KO gene has low counts in WT and KO samples, so it was filtered out if you filtered for genes with <10 counts across n samples. Perhaps look at the raw counts for this KO gene to confirm if it's lowly expressed. If it is, you could modify the filtering to retain it.
It could be highly variable in expression in WT, so the model's estimated variance parameter could be high and reduce statistical power to detect a significant difference in expression of the gene between WT and KO groups. If the gene is heteroskedastic (high variance in the WT group, low variance in KO group), increasing the number of replicates could help improve statistical detection. You could determine the homogeneity of the WT group via hierarchical clustering and/or PCA (plotting the first two principal components) to determine outliers, and then identify if it's the KO gene contributing to the behavior of these outlier samples. There are likely techniques for mitigating the effect of this kind of overdispersion.
Upstream step like alignment may've mismapped to the KO gene, or not been counted properly.

Let me know what exploratory analysis of WT and KO samples shows in terms of the count for the KO gene, and what kind of signal could be affecting the differential analysis and statistical significance.

Maze

ADD REPLY • link 12 weeks ago by mazegriff ▴ 100

score 1 · Answer 3 · 2024-10-04

1

Entering edit mode

12 weeks ago

Megan ▴ 50

It looks like there was an issue with the colData, where the KO was labeled as the WT instead. I corrected it and the fmr1 account is corrected (which is good).

ADD COMMENT • link 12 weeks ago by Megan ▴ 50

score 0 · Answer 4 · 2024-10-04

0

Entering edit mode

12 weeks ago

Megan ▴ 50

Hi Maze,

I had to use this account to response to you. Sorry about that. I made a PCA plot with both samples and this is what I got. I am going to check these samples for heteroskedasticity and see if there is a way to adjust it. It looks to me that some of the groups are too similar to each other. These are all normalized reads after doing the DESEQ. enter image description here

ADD COMMENT • link 12 weeks ago by Megan ▴ 50

1

Entering edit mode

You should add condition to this plot. If the subclusters match with condition, this isn't so bad.

Though only 10 samples? 5 v 5 would have been good to measure day/night differences. It might not be enough to do day/night and genotype at the same time.

ADD REPLY • link 12 weeks ago by swbarnes2 14k

1

Entering edit mode

Here is an updated photo. I was given this data from my PI, where there were three biological replicates each for the samples. I had to remove one of the WT from one of the groups since there was a big outlier that was affecting the data.

enter image description here

ADD REPLY • link 12 weeks ago by Megan ▴ 50

1

Entering edit mode

Hi Megan,

Excuse the later response.

The PCA plot you have with genotype and time specified shows how data points contribute to overall variance. Clustering of replicates is indicative of a strong signal or pattern in your data related to the specific group.

With almost half of the total variability in the data captured along the x-axis, any placement along this PC indicates a level of contribution to this strong pattern. Clustering along this axis indicates homogeneity like in KO.Night, KO.Day and WT.Night. KO.Night appears to drive the majority of variation (with all replicates clustering around -4). The WT.Day samples here do not indicate a strongly homogenous reference level (reference for time being Day, and for genotype being WT) in terms of PC1. Since there are only two replicates, it may be difficult to determine a consistent signal as they do not cluster near an integer on the x-axis which would indicate a similar level of contribution.The WT.Day may have within-group heterogeneity that could confound analysis with the other groups. On the y-axis, they both show a relatively stronger secondary signal that makes them similar (e.g., perhaps a characteristic that unites both samples of cells) and appears to drive much of the variation along PC2. Regarding the apparent clustering of KO.Day and WT.Night along PC1 and at or below 0 along PC2, this could indicate a similar GEX pattern as shown by their similar contribution to the primary source of variation/pattern in the data. There is perhaps not a significant difference in GEX between these two conditions. KO.Day and WT.Day are on opposite sides of 0 on the x-axis, indicating a potentially strong genotype effect (KO vs WT) reflected by PC1.The samples that align on the far extremes of the y-axis like WT.Day (strongly positive) align with PC2 and drive this secondary source of variation in the data.

Overall, these are some observations about how to potentially interpret this plot in terms of how much does each condition contribute to the primary and secondary sources of variations or patterns in the data.

I echo swbarnes2's point regarding sample size. For the WT.Day and WT.Night conditions, having only 2 replicates each is not ideal (recommending >=3), perhaps leading to a poor estimation of the dispersion parameter in the model (which can be heavily affected by heteroskedasticity and can, as average count increases for a gene, lead to higher residuals or difference between the model's prediction and the actual observed count for each gene, thus inaccurate expected counts predicted for each gene, log2FC, and subsequently statistically significant genes or DEGs). Thus the PCA analysis may not be as robust since the variability estimates within each condition are less reliable. How does the following plot look for the data?

meanSdPlot(assay(vsd))

Look forward to your thoughts.

Maze

ADD REPLY • link 11 weeks ago by mazegriff ▴ 100