Hello everyone
I've recently spent a lot of time reading about different settings for peakcaling on PE ATAC-seq data, and despite seeing many answers I still feel like I lack general consensus on upsides and downsides each of the methods to handle the data and thus can't make a reasonable conclusion on how I should proceed.
I'm handling ATAC-seq PE data. Files trimmed, aligned, filtered (blacklist, duplicates) as per general recommendations for handling this type of data. Now seeing how MACS3 and Diffbind are both commonly utilised (and thus there are plenty of tutorials for both) for handling this type of data I went along with this combination as a newbie to ATAC-seq (while still keeping other tools like csaw in the back of my head).
Now I've seen a 2 general recommended ways to proceed with peakcalling with MACS3.
Utilize callpeak function with -bam parameter and different values of extsize and shift parameter combined with --nomodel parameter (decided to stick with --shift 75 and extsize 150 - double of shift). However this recommendation made me very confused as this way of handling data completly disregards second strand of the experiment (meaning my lab could save money and do single-end experiment against the recommendation for ATAC-seq, if we end up disregarding the reverse strand anyway)
Utilize callpeak function with BAMPE parameter - therefore utilize information from both strands with this parameter --shift can be no longer set to anything other than 0, therefore other parameters are not added. This seemed like a better approach for meas it utilizes information from both strands - therefore actually making a case for PE sequencing being worth here.
Now to compare the results I did both. Then used Diffbind with summits parameter set to 75.
Utilizing first approach I get 3-4 times more consensus peaks, and respectively more differentially accessible chromation regions. Moreover I can easily identify some of these new regions by just looking up the data from bamfiles in IGV, as some of these new regions are among the regions changing the most between conditions. Therefore well lets say this passes the eyetest at least.
Utilizing the second approach I already mentioned i got significantly less peaks. Moreover some of the peaks are astoundingly long - over 500 kb - whereas utilizing the first approach that kind of a peak gets divided into multiple shorter ones - and once again eyechecking this in IGV makes me think that the first approach in fact gives more sensible results. But I still feel wrong about discarding half of the reads, therefore making paired-end libraries obsolete.
Anybody with more experience in that matter willing to share his thoughts about both of these approaches?
You might also try MACS3's hmmratac tool. I haven't given it a try, but it may be a more straightforward "out of the box" solution.
...and it will take hours last time I tried. Given there hasn't been a GitHub commit in 6 years it might still be that slow.
Huh, good to know. I never even realized it was a command until recently.
Oh, it's part of macs now. I was referring to the standalone version from their paper. Ok, maybe try it then, they might have made notable improvements since then.