DESeq2: adjusted p-values become NA when sub-setting samples
1
1
Entering edit mode
8 weeks ago
txema.heredia ▴ 200

Hi,

I have a bulk RNA dataset. Mouse data. 2 genetic backgrounds (WT and mutant(Dmin)) which are a model of human disease. For each background I have:

  • Young Control (6wk, AL)
  • Adult Control (22wk, AL)
  • Adult Treatment (22wk, DR)

4-5 samples each.

Although this data comes from a single sequencing experiment, the analysis was split into 2 papers led by different people (there is a lot of additional wet lab work on each of them, the RNA-seq is just a fraction of the analyses). Because of this, the first analysis was done using only the Young vs Adult (Control) samples (excluding Adult Treated samples from the DESeq2 object to prevent their interference and keep a balanced Young/Adult ratio) and has already been published.

The second analysis focuses on the Treatment part in Adults, but builds up on the results from the first analysis. It was decided that, similarly to what was done on the first analysis, here we would exclude the Young samples, and limit the DESeq2 object to only the Adult ones (both Treated and Control). However, an important part of the analysis focuses on comparing the DEG that changed in Mutant mice as they grew older (data generated ignoring adult+treatment mice), and how the treatment in adults affects some of those genes (data generated ignoring young mice). You can see where I'm going...

I have a couple of very interesting biologically relevant genes. If I analyze the data using all the samples (from now on deemed "ALL") both are significant. If I analyze the data using only the adult samples (now deemed "ADULT"), both genes have padj = NA. I am focusing mostly in the comparison 22DminDR vs 22DminAL (Adult Mutant samples, Treated vs Control)

This is how the normalized counts look like:

Normalized counts

The dashed line is the baseMean in ADULT. The solid line is in ALL. The color of the boxplots is the sample’s age. the ADULT analysis uses only the blue ones. The groups are significant when using a naive and simple T-test.

And these are the results of DESeq2 for the 22DminDR vs 22DminAL comparison:

ALL samples:

> res_all[c("Spp1","Cd68"),]
log2 fold change (MAP): modelvar 22DminDR vs 22DminAL 
Wald test p-value: modelvar 22DminDR vs 22DminAL 
DataFrame with 2 rows and 5 columns
      baseMean log2FoldChange     lfcSE      pvalue        padj
     <numeric>      <numeric> <numeric>   <numeric>   <numeric>
Spp1   35.4312       -2.72427  0.473436 2.76300e-10 1.79236e-06
Cd68   39.0621       -1.08386  0.320647 2.81564e-05 4.35931e-03

ADULT samples:

> res_adult[c("Spp1","Cd68"),]
log2 fold change (MAP): modelvar 22DminDR vs 22DminAL 
Wald test p-value: modelvar 22DminDR vs 22DminAL 
DataFrame with 2 rows and 5 columns
      baseMean log2FoldChange     lfcSE      pvalue      padj
     <numeric>      <numeric> <numeric>   <numeric> <numeric>
Spp1   48.1062       -2.80547  0.336887 1.39669e-18        NA
Cd68   44.6694       -1.03332  0.305645 2.64104e-05        NA

I have been reading about this from several sources:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pvaluesNA

Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons:

  • If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.
  • If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook’s distance. Customization of this outlier filtering and description of functionality for replacement of outlier counts and refitting is described below
  • If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below

And following a few links,

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilt

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilttheory

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers

https://support.bioconductor.org/p/92428/

I have been checking all these values for both comparisons:

  • Cooks distance:

ADULT samples:

Control (Adult, Mutant)

> assays(dds_adult)[["cooks"]][c("Spp1","Cd68"), md[md$modelvar=="22DminAL",]$sample]
            005        013          004         026
Spp1 0.16294652 0.08470974 5.871258e-01 0.003260566
Cd68 0.09220205 0.03160875 6.519192e-05 0.025208679

Treatment (Adult, Mutant)

> assays(dds_adult)[["cooks"]][c("Spp1","Cd68"), md[md$modelvar=="22DminDR",]$sample]
             006          014        029        016          022
Spp1 0.097962191 0.0001170788 0.03350797 0.11415755 0.3792535252
Cd68 0.008037438 0.0421175078 0.04893443 0.08676744 0.0006556893

ALL samples:

Control (Adult, Mutant)

 > assays(dds_all)[["cooks"]][c("Spp1","Cd68"), md[md$modelvar=="22DminAL",]$sample]
            005        013         004         026
Spp1 0.05697454 0.04023993 0.250256327 0.002499021
Cd68 0.06793757 0.02701774 0.001844374 0.023258608

Treatment (Adult, Mutant)

> assays(dds_all)[["cooks"]][c("Spp1","Cd68"), md[md$modelvar=="22DminDR",]$sample]
             006          014        029        016         022
Spp1 0.052105796 1.310099e-05 0.02087917 0.06202494 0.165106367
Cd68 0.001446295 1.449497e-02 0.03282256 0.03355575 0.005948603

I don’t think the cooks distance differences are large enough to justify removing these genes from the analysis


  • Independent Filtering

Number of rejections: (ADULT vs ALL)

plot(metadata(res_all)$filterNumRej, 
     type="b", ylab="number of rejections",
     xlab="quantiles of filter", main="ALL")
lines(metadata(res_all)$lo.fit, col="red")
abline(v=metadata(res_all)$filterTheta)

Rejections Adult

Rejections All

I honestly don’t know what I’m really looking at here, but both these figures look wildly different with the example in the Bioconductor vignette.

enter image description here


  • Filter threshold

ADULT samples

> metadata(res_adult)$filterThreshold
75.5017% 
49.8858

# “natural” pval = NAs
> summary(res_adult[is.na(res_adult$pvalue),]$baseMean)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
  0.00000   0.00000   0.00000   0.08045   0.00000 133.13102 

# pval != NA, but padj = NA
> summary(res_adult[!is.na(res_adult$pvalue) & is.na(res_adult$padj),]$baseMean)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.01763  0.13640  0.73736  6.59878  6.21863 49.88463 

# padj != NA
> summary(res_adult[!is.na(res_adult$pvalue) & !is.na(res_adult$padj),]$baseMean)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   49.89    87.80   153.62   350.21   296.26 44713.67 

ALL samples

> metadata(res_all)$filterThreshold
61.8497% 
5.917652

# “natural” pval = NAs
> summary(res_all[is.na(res_all$pvalue),]$baseMean)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.00000  0.00000  0.00000  0.06754  0.00000 74.84557 

# pval != NA, but padj = NA
> summary(res_all[!is.na(res_all$pvalue) & is.na(res_all$padj),]$baseMean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01110 0.06538 0.23794 0.82457 0.97549 5.91727 

# padj != NA
> summary(res_all[!is.na(res_all$pvalue) & !is.na(res_all$padj),]$baseMean)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    5.92    30.31    80.68   231.81   196.71 42934.59 

I don’t know what I should be expecting here, but having the baseMean threshold jump from 5.9 to 49.8 looks like the exact reason why these genes are being excluded from the analysis.

It seems it is clearly driven by the independent filtering.

use <- res_all$baseMean > metadata(res_all)$filterThreshold
h1 <- hist(res_all$pvalue[!use], breaks=0:50/50, plot=FALSE)
h2 <- hist(res_all$pvalue[use], breaks=0:50/50, plot=FALSE)
colori <- c(`do not pass`="khaki", `pass`="powderblue")
barplot(height = rbind(h1$counts, h2$counts), beside = FALSE,
        col = colori, space = 0, main = "ALL", ylab="frequency")
text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)),
     adj = c(0.5,1.7), xpd=NA)
legend("topleft", fill=rev(colori), legend=rev(names(colori)))

pass baseMean threshold Adult

pass baseMean threshold All

Which are, again, wildly different from what is described in the vignette

pass baseMean threshold bioConductor

It seems clear that the choice of sample subsetting (or not) causes a change in the baseMean of the gene, but, most importantly, in the minimum baseMean threshold that the independent filter uses to remove genes from the analysis.

What am I really looking at here? What numbers should I be expecting here? Is a baseMean filter of 49.8 reasonable or is it extremely high? Should I keep using for the analysis only the ADULT samples, or should I convince my coauthors to use ALL the samples? (that might bias other interesting genes, though, I think that’s in fact why we ended up choosing the ADULT subset).

Or is there something weird in this data that caused the independentFiltering to go crazy and I should simply ignore it and re-run the analysis with results(dds, independentFiltering=FALSE) ? Should I just leave it be and mention that these two genes are significantly up/down-regulated but their dispersion is too high to give an accurate log2FC estimate?

Thanks in advance

R DEG RNA-seq DESeq2 • 487 views
ADD COMMENT
0
Entering edit mode

First off, thank you for posting such a detailed question with such clear info as to your experimental setup and actual concerns. I'll be interested to see what others say (including Mike Love if he happens to see this).

As for solutions, yes, you can turn off independent filter and/or use a different filtering approach for low count genes like filterByExpr from edgeR that provides a bit more control over the process.

Alternatively, just keep all the samples in the object since it sounds like ultimately you're going to be using all of them regardless.

ADD REPLY
0
Entering edit mode

While we discourage cross-posting Mike Love does not stop by here often (has not bee here for 7 months as of now) so cross-posting on bioconductor support site may be the way to go. If Mike answers in one of the two places please close the question out referring to answer.

ADD REPLY
1
Entering edit mode
22 days ago
txema.heredia ▴ 200

I cross-posted the original post to Bioconductor and it got replied there:

https://support.bioconductor.org/p/9160134/

One thing to do is just turn off independentFiltering altogether. If you know what are reasonable filters that maintain power by discarding genes with "too low" counts then go ahead and implement that with independentFiltering=FALSE. "Too low" is context dependent, so it's easier for an analyst to apply than for software to figure out. What this routine (from genefilter package) does is to compute too low based on power, but in some circumstances, like comparing across pairwise comparisons, it's more desirable to have consistent filtering.

In short, when "comparing comparisons" it might be smart to disable independentFiltering and manually use a homogeneous minimum threshold on all the comparisons.

ADD COMMENT
1
Entering edit mode

I moved your comment to an answer since it has information about your original question. You can accept this answer (green check mark) to provide closure to this thread.

ADD REPLY

Login before adding your answer.

Traffic: 2699 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6