As there is no perfect answer but only homemade solutions, here is mine according to what I'm interested in.
So, as suggested by ATpoint, I need to find a "number of mapped reads on a given area" threshold to limit the call of false positive peaks (bin with high fold change but low mapped reads). With this threshold I should be able to detect which bin may disturb the analysis and remove from bam files the reads involved in those bins.
I started with the raw bam files, I will take one mark for this example, let's say H2AZ like in my post, I have 4 files :
- aB_H2AZ.bam
- rB_H2AZ.bam
- aB_input.bam
- rB_input.bam
Bam to bedgraph
First step get a bedgraph file with the reads count per bin. For this I used bamCoverage.
--binSize 100
: is a good compromise between file size (which could be very large for bin at 10, for example) and useful count information.
--region chr12
: I only focus on chromosome 12
--normalizeUsing RPGC
: Normalization method implement in bamCoverage which is number of reads per bin / (total number of mapped reads * fragment length) / effective genome size
--effectiveGenomeSize 2652783500
: In the RPGC formula we need the effective genome size which is 2652783500
for mm10
Note : By defaut bamCoverage generate bedgraph merging bins with the same read count values, in order to save space. As I wanted to compared read counts per bin I wanted to get the exact same number of bin with the same bin length in each file. Discussion with one of the author leads to uncomment some lines in writeBedGraph.py
https://github.com/deeptools/deepTools/blob/master/deeptools/writeBedGraph.py#L248-L256
Mean standard deviation plot
In order to catch potential false positive peaks comming from bins analysis, I created a meanSDplot under R, using DESeq2 functions, to visualize the log2(SD)
over thelog2(mean)
Create a merge file of the bedgraph from last step to create a single matrix in R
Note : Inspecting my meanSDplot, I chose a threshold of log(mean) = 3, to remove high log(SD)
with low log(mean)
? Means that every bins which less than 2^3 = 8 reads as mean (aB/rB) will be discarded. This is why there is a threshold
attribute in run_meanSDplot
function. First run the code without threshold, then when you think you've got one, fill it to get the best_bins
R code :
The script output 2 things :
- MeanSDplot to chose a threshold
- List of the "best" bins passing the threshold
High counts bam files
Create new bam file from the raw one with only reads falling in the "best" bin, using bedtools intersect
Peak calling
From here, the filtering part is over, just call your peaks using macs2 callpeak
macs2 callpeak -t aB_H2AZ.best.bam -c aB_input.bam -g mm --outdir peakcall --nomodel --extsize 147 --mfold 5 50 -B -n aB_H2AZ
macs2 callpeak -t rB_H2AZ.best.bam -c rB_input.bam -g mm --outdir peakcall --nomodel --extsize 147 --mfold 5 50 -B -n rB_H2AZ
Note : I used --nomodel
and --extsize 147
because macs2 struggle to build a model with few number of peaks (due to our filtering). The result will go to peakcall directory
Differential peaks
Last but not least, find peaks that are only present in active condition, resting condition or present in both.
We will use macs2 bdgdiff
which needs the number of mapped reads to apply a library size normalization, using samtools
#Number of mapped reads for raw bam file
for file in *.bam; do echo $file; samtools view -F 0x904 -c $file; done;
#aB_H2AZ.bam
#42559264
#rB_H2AZ.bam
#27689824
Then, run the command line :
macs2 bdgdiff --t1 peakcall/aB_H2AZ_treat_pileup.bdg --t2 peakcall/rB_H2AZ_treat_pileup.bdg --c1 peakcall/aB_H2AZ_control_lambda.bdg --c2 peakcall/rB_H2AZ_control_lambda.bdg --d1 42559264 --d2 27689824 --min-len 150 --max-gap 100 --outdir bdgdiff -o aB_H2AZ.diff.bdg rB_H2AZ.diff.bdg common_H2AZ.diff.bdg
Note : H2AZ is a sharped mark so no need to call the --broad
parameter here.
And, that is it, method looks not perfect as there is some a priori with the threshold value + there is no duplicates. But we managed to limit the effect of false positive peak calling due to some low counts, which is nice.
Assuming you do not have replicates to run a bayesian framework such as
csaw
, I would think about some prefiltering prior tobdgdiff
. Maybe use either a window-based approach (maybe half the average peak width) or the peak coordinates themselves to remove from the bedgraphs all entries when having enrichments or depletions larger than a certain threshold & a count in either of the conditions lower than a certain threshold. For the count threshold, I would probably inspect an MA plot to get an idea at which A (x axis) the M's become unreasonably large.