Question

ATAC-seq paired-end data peakcalling on MACS3 - BAMPE vs (BAM and shift with extsize)

0

Entering edit mode

5 months ago

Manko47 ▴ 10

Hello everyone

I've recently spent a lot of time reading about different settings for peakcaling on PE ATAC-seq data, and despite seeing many answers I still feel like I lack general consensus on upsides and downsides each of the methods to handle the data and thus can't make a reasonable conclusion on how I should proceed.

I'm handling ATAC-seq PE data. Files trimmed, aligned, filtered (blacklist, duplicates) as per general recommendations for handling this type of data. Now seeing how MACS3 and Diffbind are both commonly utilised (and thus there are plenty of tutorials for both) for handling this type of data I went along with this combination as a newbie to ATAC-seq (while still keeping other tools like csaw in the back of my head).

Now I've seen a 2 general recommended ways to proceed with peakcalling with MACS3.

Utilize callpeak function with -bam parameter and different values of extsize and shift parameter combined with --nomodel parameter (decided to stick with --shift 75 and extsize 150 - double of shift). However this recommendation made me very confused as this way of handling data completly disregards second strand of the experiment (meaning my lab could save money and do single-end experiment against the recommendation for ATAC-seq, if we end up disregarding the reverse strand anyway)
Utilize callpeak function with BAMPE parameter - therefore utilize information from both strands with this parameter --shift can be no longer set to anything other than 0, therefore other parameters are not added. This seemed like a better approach for meas it utilizes information from both strands - therefore actually making a case for PE sequencing being worth here.

Now to compare the results I did both. Then used Diffbind with summits parameter set to 75.

Utilizing first approach I get 3-4 times more consensus peaks, and respectively more differentially accessible chromation regions. Moreover I can easily identify some of these new regions by just looking up the data from bamfiles in IGV, as some of these new regions are among the regions changing the most between conditions. Therefore well lets say this passes the eyetest at least.

Utilizing the second approach I already mentioned i got significantly less peaks. Moreover some of the peaks are astoundingly long - over 500 kb - whereas utilizing the first approach that kind of a peak gets divided into multiple shorter ones - and once again eyechecking this in IGV makes me think that the first approach in fact gives more sensible results. But I still feel wrong about discarding half of the reads, therefore making paired-end libraries obsolete.

Anybody with more experience in that matter willing to share his thoughts about both of these approaches?

MACS3 BAMPE BAM Peakcalling ATAC-seq • 1.2k views

ADD COMMENT • link updated 5 months ago by ATpoint 88k • written 5 months ago by Manko47 ▴ 10

0

Entering edit mode

You might also try MACS3's hmmratac tool. I haven't given it a try, but it may be a more straightforward "out of the box" solution.

ADD REPLY • link 5 months ago by jared.andrews07 ★ 19k

0

Entering edit mode

...and it will take hours last time I tried. Given there hasn't been a GitHub commit in 6 years it might still be that slow.

ADD REPLY • link 5 months ago by ATpoint 88k

0

Entering edit mode

Huh, good to know. I never even realized it was a command until recently.

ADD REPLY • link 5 months ago by jared.andrews07 ★ 19k

1

Entering edit mode

Oh, it's part of macs now. I was referring to the standalone version from their paper. Ok, maybe try it then, they might have made notable improvements since then.

The orginal paper was published in 2019, and the original software was written in JAVA, by the then PhD student Evan Tarbell, a mohawk bioinformatician. In MACS3 project, we implemented HMMRATAC idea in Python/Cython and optimize the whole process using existing MACS functions and hmmlearn.

ADD REPLY • link 5 months ago by ATpoint 88k

score 5 · Accepted Answer · 2025-02-11

Now let me toss my personal best practice: The thing with ATAC-seq is that indeed it is paired-end but actually the location where the transposase integrated into the genome is exactly the end of two reads, so technically, despite a fragment comes from a paired-end read, the two ends have actually nothing to do with each other, they were cut by two different transposases. That having said, I personally take the beginning of every R1 and R2, reduce it to 1bp, save as BED file and then have macs read this BED and use --keep-dup=all --nomodel --extsize 100 --shift -50 --min-length 250. That is similar to what you describe in 1) but does not throw away R2.

Now, do I have any benchmark data that this is good/better/worse than anything else? No, I don't. It's something that in my head makes sense, and given there is no gold standard or consensus anyway, I just made my own. Sometimes I use the Genrich peak caller as well. Works well in my hands most of the time.

I would not use 2) since this imho inflates peak width.