PCR duplicate removal in single-end atac-seq data
1
0
Entering edit mode
6 months ago
sarahmanderni ▴ 120

Hi,

How important is it to remove the PCR duplicates (using example samtools and picard) before performing macs peak calling for SINGLE end atac-seq data? I need to carefully decide as the samples are patient derived samples with few cells so we didnt have the usual number of cells required to load for sequencing (5000 to 25000 cells per sample) and now I am thinking what if there hasnt been enough material to sequence and just ended up with PCR duplicates. Other issue is of course this is single-end data. Thanks!

ATAC-seq pcr-duplicate • 610 views
ADD COMMENT
2
Entering edit mode
6 months ago
ATpoint 86k

I think it is always important to remove duplicates for peak calling as otherwise you might be calling a lot of "signal" that is actually just a pipeup of PCR artifacts. Depending on the peak caller the software might remove duplicates automatically (macs3 for example can do that), but I personally always make a dedicated bam file with duplicates (and other filtering criteria I find reasonable) removed.

I understand that few cells might lead to a little reduction in quality, but actually in my hands (in the last years since we do ATAC-seq) the assay is actually very robust against fluctuations in input material, as long as cells were viable, and protocol was performed correctly. I definitely recommend against including more noise into the analysis as a compensation for experimental shortcomings. This in the end just accumulates uncertainty, which with suboptimal experimental setup is anyway always an issue. I don't think it helps.

The fact that it is single-end is just unfortunate and frankly a bad design decision, as the observation of the typical ATAC-seq banding pattern (beyond the Bioanalyzer/TapeStation quality control), is a valuable QC metric -- especially when input material is low and experimental outcome is uncertain. Companies such as Novogene offer very cost-effective PE150bp sequencing these days, and at least in our hands we never found any provider that (for plain sequencing of this type) could beat that price, plus you get the full paired-end information.

My recommendation is to remove duplicates, then make a bigwig track and just look at the data in the IGV. If there is a good separation between peaks and noise it's fine. Else, you might want to decide whether the data can give you anything. Feel free to post the tracks then I can give feedback. Another important QC metric is FRiPs FRIP score ATAC-seq

ADD COMMENT
1
Entering edit mode

Thank you so much for the very comprehensive and clear answer.

ADD REPLY
1
Entering edit mode

Adding on this, you can still pool all bam files and call peaks on that. And then let differential analysis (with hopefully a good n per groups) let do the actual "analysis" in terms of finding differences. Peak calling is (to me) really just a lowlevel step to define what parts of the genome are considered for the count matrix. Or you even do peak-less analysis, with something like the csaw package https://bioconductor.org/books/release/csawBook/ and then for downstream analysis like motif enrichment take regions (or here windows which csaw uses) that are significantly different. Lots of options here, depending on how "good" the data are.

ADD REPLY

Login before adding your answer.

Traffic: 1982 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6