fastp dedup: identical reads only?
1
0
Entering edit mode
19 months ago
Darked89 4.7k

Hello,

I got some sequencing data (chip-seq, single, 50-75bp) with at times high duplication rates. Since the duplicates will not be used for the peak calling downstream I wonder if it would be better to remove these before mapping saving some disc space and cluster time.

To filer out duplicates I want to use fastp with the default (3) dup_calc_accuracy level.

My questions:

  1. is such deduplication at the fastq level a sensible step in chip-seq?
  2. are there any noticeable benefits using higher dup_calc_accuracy levels?

Many thanks for your help / opinions

DK

fastq chip-seq • 2.1k views
ADD COMMENT
1
Entering edit mode

You could also use clumpify.sh from BBMap suite and keep the count of dedeuplicated reads in fastq headers. Do you have such excessive duplication that you are worried about this?

If we use MACS2 do we need to remove duplicate sequences with samtools rmdup ?
The Duplicates Dilemma of ChIP-Seq

ADD REPLY
0
Entering edit mode

I have seen +40% at times. So if these are failed samples I want them to fail early.

Also I got multiple 2-3 runs per sample, so merging heavily duplicated runs at the be it FASTQ or BAM stage and then discarding duplicates looks counterintuitive. I will run samtools markdup on merged BAMs anyway.

ADD REPLY
1
Entering edit mode

40% does by no means indicate a failed sample. The ChiP-ed protein can either be not very abundant, or the antibody is just not good. Still can be a valid sample. ChIP is tricky, definitely look at the data at the IGV before making a decision.

ADD REPLY
0
Entering edit mode

The high duplication rates are from patients hist-pat paraffin blocks. Little starting material + histone marks. I try to use what is left (no duplicates, primary alignments, MAPQ >=15 ) but MACS often throws the towel and rightly so since deeptools fingerprints look scary. Sicer/epic2 still call bunch of peaks. Quality of these is something for another topic.

ADD REPLY
0
Entering edit mode

re clumpify.sh: I will run the tests to make sure that fastp and clumpify do more or less the same thing. While I am a fan of clumpify I am not sure if it can produce some easy to parse run record as fastp.

ADD REPLY
1
Entering edit mode
19 months ago

note that MACS2 will ignore duplicate alignments

also the word "duplicates" is somewhat of a misnomer; you could have identical reads (fastp can detect these) or identical alignments that span the same coordinates (fastp cannot detect these in paired-end data)

what you want to remove are duplicated alignments;

when we remove identical reads, we risk enriching for those PCR artifacts that happen to have a sequencing error

ADD COMMENT
0
Entering edit mode

Good points. But I run samtools markdup prior to any peak calling. I will recheck, but expect that the resulting BAMs should be +/- identical if I remove (not all duplicates) at the FASTQ stage or not.
If duplicate removal is faster than mapping then marking these reads then we gain something with problematic data. If the duplication rates are quite low then imho there is no need to bother doing extra steps.

ADD REPLY
0
Entering edit mode

I think duplicate removal is like read trimming,

a remnant of a past era where sequencing reads were far more error prone, coverages were very low, the likelihood of natural duplicates vanishingly small, and making it work counted as success.

and of course, bioinformaticians like to fiddle with the reads, it creates that satisfying feeling of doing something, improving the data, but in the big picture, it has probably no notable effect

today you don't need to remove adapters, the aligner will softclip them with ease, runaway PCR duplication is far less of a problem

when someone has substantial PCR artifacts, or substantial adapters - that data is very suspect and probably a loss - no amount of tinkering will help - though of course one can still publish it and with that add to the growing pile of irreproducible results

that is my somewhat cynical take on quality control in general; if someone needs it to get results ... they are probably in trouble

ADD REPLY
0
Entering edit mode

Plus, to add to this: when it comes to ChIP Seq data, I could never reconcile the rationale with reality -

In a ChIP-Seq experiment we are covering a tiny subset of a genome. The more accurate the coverage of the binding site, the more likely it is to produce identical fragments, hence natural duplicates

By removing duplicates we select against accurate fragmentation ... isn't that the most counterintuitive thing in the world?

By removing duplicates, we remove the ability to separate high occupancy from low occupancy regions. In return we hope to get protection from potential false positives ... but that should be fixed with replication and having a proper background data, not by losing the ability to distinguish between high occupancy and by selecting against accurate fragments ...

ADD REPLY

Login before adding your answer.

Traffic: 2419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6