DESeq2 unequal sample sizes
1
1
Entering edit mode
7.3 years ago
glados ▴ 40

Hi,

I'm using DESeq2 to look for differentially expressed genes. I have three groups with different individuals: Treatment A, Treatment B, and Control.

Sample sizes (individuals) differ between the groups: A=9, B=5, Control=5.

If I want to test A directly against B, the sample size difference is of no problem, I understand that (asked already in this question: https://www.biostars.org/p/90421/).

But, when I want to test A vs Control and see how that differs from B vs Control, I worry about the unequal sample sizes. I expect the A vs Control comparison to find slightly more significant genes just because of more accurate dispersion.

How should I deal with this problem? Is 9 vs 5 not a big enough difference to worry about, or should I randomly subsample 5 individuals from treatment A (i.e. discard 4) and only use those in the comparison with the Controls?

Grateful for any advice.

R deseq deseq2 • 5.0k views
ADD COMMENT
5
Entering edit mode
7.3 years ago

Yes, when the numbers are unequal in this way, you will see many more genes passing your FDR Q value threshold. I faced this situation 2 years ago but for even more extreme/unequal sample distributions. I think that it's reasonable to use difference thresholds for each comparison.

Here is some code that allows you to visually check how FDR affects different comparisons by showing how many significant variables remain at each level of FDR - it also gives the nominal P-value equivalent at each FDR. in this way, you can choose a good cut-off.

options(scipen=999)

topT <- as.data.frame(AVsControl)
AVsControl.pvalues <- topT$pvalue[!is.na(topT$pvalue)]

#5% FDR
alpha <- 0.05
AVsControl.cutoff0.05 <- max(topT$pvalue[topT$padj<alpha], na.rm=TRUE)

#1% FDR
alpha <- 0.01
AVsControl.cutoff0.01 <- max(topT$pvalue[topT$padj<alpha], na.rm=TRUE)

et cetera...

pdf("RankPValue.pdf")
        par(mar=c(5,8,5,8), cex=0.8, cex.main=0.8, cex.lab=0.8, cex.axis=0.8, mfrow=c(1,2))

        #A versus Control
            orderInPlot <- order(AVsControl.pvalues)
            showInPlot <- (AVsControl.pvalues[orderInPlot] <= 0.05)
            alpha <- 0.05
            if (any(showInPlot))
            {
                plot(
                   seq(along=which(showInPlot)),
                   AVsControl.pvalues[orderInPlot][showInPlot],
                   type="l",
                   pch=".",
                   main=paste("Significance cut-off @ 5% FDR = \n", round(AVsControl.cutoff0.05,9)),
                   xlab=expression(paste("Rank ", italic("P"), " value (num. sig. genes)")),
                   ylab=expression(paste(italic("P"), " value")))
                abline(a=0, b=alpha/length(AVsControl.pvalues), col="red3", lwd=1)
            }

            orderInPlot <- order(AVsControl.pvalues)
            showInPlot <- (AVsControl.pvalues[orderInPlot] <= 0.01)
            alpha <- 0.01
            if (any(showInPlot))
            {
                plot(
                   seq(along=which(showInPlot)),
                   AVsControl.pvalues[orderInPlot][showInPlot],
                   type="l",
                   pch=".",
                   main=paste("Significance cut-off @ 1% FDR = \n", round(AVsControl.cutoff0.01,9)),
                   xlab=expression(paste("Rank ", italic("P"), " value (num. sig. genes)")),
                   ylab=expression(paste(italic("P"), " value")))
                abline(a=0, b=alpha/length(AVsControl.pvalues), col="red3", lwd=1)
            }
dev.off()

Screenshot

ADD COMMENT
0
Entering edit mode

Kevin, thank you so much for your reply and code! Very appreciated. I did an analysis using only 5 subsampled individuals in groupA and I see about 40% fewer significant genes so it does indeed have a dramatic effect on the results.

I had not yet considered to use a stricter FDR threshold for groupA, though that makes a lot of sense. Thank you for suggesting it. That would allow me to keep all the individuals and the test will therefore become more accurate.

I did the plots based on your code, but have a slight problem understanding what they mean. Indeed, the red line cuts the curve at fewer genes for each lowered threshold (about 1500 genes at FDR 10%, 1000 at 5%, and about 500 at FDR 1%). But how does this inform me of what FDR to choose? Do I compare this somehow to the groupB results? Thank you for your advice.

ADD REPLY
0
Entering edit mode

Hey glados,

Yes apologies for not elaborating further. These plots assisted me because, with them, I was able to quickly generate dozens of plots at different FDR thresholds and determine the number of genes passing FDR at each threshold. This was useful because we wanted a roughly even number of genes from each comparison for further downstream analyses.

The plots also help in a manuscript situation whereby, if you represent your different cut-offs visually, it provides just that little bit of extra assurance to reviewers that you have thought very deeply about the sample imbalance and how to best extract differentially expressed genes from your comparisons.

ADD REPLY

Login before adding your answer.

Traffic: 1938 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6