1- The DiffBind vignette (section 8.2 - Deriving consensus peaksets) states
When performing an overlap analysis, it is often the case that the overlap criteria are set stringently in order to lower noise and drive down false positives. It is less clear that limiting the potential binding sites in this way is appropriate when focusing on affinity data, as the differential binding analysis method will identify only sites that are significantly differentially bound, even if operating on peaksets that include incorrectly identified sites.
Does not it makes sense then to be permissive when setting consensus peaksets for affinity analysis? Practically speaking, this means taking all peaks which appear in at least one sample is a good default strategy (unless there is a reason not to).
2- If one decides she wants to be nevertheless more stringent, I think that there is a large conceptual difference between peaks that are shared between two replicates of the same condition (but not across the conditions), and two peaks that are shared between two samples of different conditions (but not across the replicates within the conditions).
In the first case, the fact that the peaks exist for both replicates brings greater confidence in the peak called. In the other case however, the fact that the same peak exists in two conditions does not suggest a greater confidence that there is truly differential affinity at that site. If two peaks exist across two different conditions (but not within the conditions), this suggests quite the opposite - that this is not an area with differential affinity. That is because it (a) exists in both of the conditions which lessens the plausibility of it being differential, and (b) does not exist in the replicates of each condition, which it increases the possibility of it being random.
Practically speaking, this means that a good default way to perform a stringent analysis is to define a peakset (for differential affinity) as peaks that exist in at least two replicates of a condition.
Do you think these points are valid?
My question was not how to do it - but conceptually, should not this be the default way to do an analysis, unless there is a good reason not to?
The question of defaults for the software is often a tricky one.
I do think the mutliple-replicate method is a good idea in many cases, which is why it is included in the vignette. However there are also cases in multi-factor samples where some sample groups are similar across one factor but different across another. The idea of the default being more permissive is to include regions that have replicated enrichment detected, and rely on the modeling to filter out the non-differential intervals. This begs the question of why not just include any and all peaks; the default requirement for some minimal replication does filter out a lot of noise and reduces the number of tests when it comes to multiple testing correction while imposing a light touch compared to requiring within-sample group replication.
There is also the historical aspect; this has been the default since
DiffBind
's debut in 2011, and there's a fairly high bar for changing defaults.