I'm reading literature on m6A-seq analysis and came across a step I don't fully understand.
Why, after aligning reads to the reference genome, is it common practice to run the MarkDuplicates module from the Picard toolkit on the resulting BAM files to remove duplicates?
Is this step necessary?
What is the exact purpose of it?
What is the rationale for removing PCR duplicates (i.e., reads arising from PCR amplification rather than independent RNA fragments)?
Here's an excerpt from the paper:
Reads that passed quality control were trimmed of adaptors using Trim Galore! v0.4.0 and aligned to mm10 using bowtie2 v2.2.5131. Unmapped reads were then mapped to hg19. SAM files were converted to BAM files, sorted, and indexed with samtools v1.9. Bam files were deduplicated using MarkDuplicates (picard v2.18.14). Peaks were called with MACS2 (using IP samples and their input as controls) with the options --gsize 3.0e9 -q 0.05 --nomodel --broad and annotated by intersecting center positions with RefSeq annotations. The most upstream and downstream annotated TSS and TES, respectively, were considered for each gene. Peak analysis was performed using DiffBind v3.0.15 with score=DBA_SCORE_TMM_READS_FULL_CPM. Normalization was done with edgeR, using the ratio of mm10/hg19 reads (relative to respective input sample). For TSS analysis, a 1kb window surrounding the TSS of every RefSeq gene was used. Bigwig files were generated with Deeptools v3.3.0 using –scaleFactor for normalisation and visualized in Integrated Genome Viewer (IGV v2.9.4).