Question

Help understanding the meaning of variables in DESeq design matrix

1

Entering edit mode

6.9 years ago

Kristin Muench ▴ 640

Hello,

I have a dataset with a variety of samples that vary in Age (Young/Old) and Sex (M/F).

I'm interested in testing a few hypotheses, including (Q1) "What genes are DE as a product of Sex?" and (Q2) "What genes are DE as a result of the interaction of Age and Sex?"

To answer Q1, I originally imported data like so:

# Attempt 1
myData <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable_forAllMySamples,
                                     directory = pathToHTSeq,
                                     design = ~Sex)
dds <- DESeq(myData)

This produced a very large DE gene list.

Later, I redid this analysis with a different design matrix including interaction and contrasts, like so:

# Attempt 2
myData <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable_forAllMySamples,
                                     directory = pathToHTSeq,
                                     design = ~Sex+Age+Sex:Age)
dds <- DESeq(myData )
res <- results(dds,  contrasts=c('Sex', 'M', 'F'))

However, this produced a MUCH smaller list of DE genes.

My understanding is that pulling out the contrasts should look for the main effect of Sex in my dataset (so, Sex effects regardless of Timepoint).

I had expected that would be the same as if I just made design matrix ~Sex, but it looks like that isn't the case. Why is that?

It that because Attempt 2's design matrix "controls" for Age and any interaction effects, but Attempt 1 does not? Can anyone help me understand a bit better what is being tested in Attempt 1, or point me towards resources to strengthen my understanding of what that was doing?

Possibly relevant: When I PCA plotted my rlog-normalized data, the data clustered very well by Sex, and less well by Age.

Thank you very much for your help!

DESeq2 RNA-Seq R • 2.3k views

ADD COMMENT • link updated 6.9 years ago by Devon Ryan 105k • written 6.9 years ago by Kristin Muench ▴ 640

score 2 · Accepted Answer · 2018-08-15

2

Entering edit mode

6.9 years ago

Devon Ryan 105k

Things like slightly imbalanced group sizes (in this case, the numbers of males and females at each age) as well as the difference power increasing with sample size are the prime causes for this. I should note that the results with just ~sex as the design likely have more false-positives, since they're not accounting for the confounder of age (as you astutely surmised).

It's pretty common for samples to cluster strongly by sex, its effect isn't as variable as something like age.