For why these genes get called differential you have to understand DESeq2's approach to estimating per-gene dispersion.
The raison d'etre of DESeq2 and edgeR is to try to find accurate estimates for the each gene's dispersion with only a small number of replicates. They do this by sharing information between genes on the understanding that genes of a similar expression level are, on average, going to have a similar dispersion. Thus they estimate average dispersion for all the genes and fit a smoothed function to the mean-expression vs dispersion estimate. This represents their expected dispersion. Then for each gene they "shrink" that genes dispersion estimate towards the expected dispersion using a baysian proceedure. This means that genes with a small dispersion will have their dispersion increased and are therefore less likely to be significant, and for genes with a large dispersion, like these ones here, will have their dispersion decreased, making it more likely they will be significant.
In their paper they Love et al state:
If, on the other hand, an individual gene’s dispersion is far above the
distribution of the gene-wise dispersion estimates of other genes,
then the shrinkage would lead to a greatly reduced final estimate of
dispersion. We reasoned that in many cases, the reason for
extraordinarily high dispersion of a gene is that it does not obey our
modeling assumptions; some genes may show much higher variability than
others for biological or technical reasons, even though they have the
same average expression levels. In these cases, inference based on the
shrunken dispersion estimates could lead to undesirable false positive
calls. DESeq2 handles these cases by using the gene-wise estimate
instead of the shrunken estimate when the former is more than 2
residual standard deviations above the curve.
My guess is that this safety-value is "failing" in your case. This is probably because there is a large standard deviation in the distances of the gene-wise dispersion estimates from the expected dispersion (due to only having two reps), thus the large dispersions seen here would be less than 2sd from the expected curve and the shruken value for dispersion is used.
If you want to avoid this, you could use DESeq (rather DESeq2) and set the sharingMode="maximum". Here the maximum of the trend and the per-gene dispersion estimate is used, which is a more conservative approach (DESeq was way more conservative than other methods of its time).
See also How to add images to a Biostars post
With only two samples any comparison you make is entirely invalid and all time you spend is wasted. You have no means to figure out if the variance you observe is biologically relevant variance, biological noise or technical issues.
The absolute minimal number of replicates per group would be 3, but ideally, if budget permits, the quality of your data would be much better with 5 or more replicates per group.
Hi WouterDeCoster,
Thanks for showing me how to add figures to the main post. Very useful!
I quite disagree with as extreme a view as "any comparison with n=2 entirely invalid and all time you spend is wasted". Often, we do an exploratory n=2 to figure out things like most appropriate timepoint, dosage, etc. There are a couple 100s of genes that show excellent concordance with n=2 and significant D.E., (and indeed are biologically meaningful) and to me, that is a valid and useful takeaway from the n=2 experiment. Of course there is no doubt n>>3 will improve all measurements.
Yes, one cannot infer anything from the data points showing extreme variance when n=2, but my question was more about why, such data points are not being flagged in some way with a high p-value, but in fact, are assigned p-values as low as well replicated data points.