Question

Did you remove ChIP-seq duplicates

3

Entering edit mode

7.2 years ago

mikysyc2016 ▴ 120

Hi, when you analysis ChIP-seq data(fastq file). Did you remove duplicates from the data? which command and software you used? Thanks!

ChIP-Seq • 14k views

ADD COMMENT • link updated 7.2 years ago by i.sudbery 22k • written 7.2 years ago by mikysyc2016 ▴ 120

score 6 · Answer 1 · 2018-06-06

6

Entering edit mode

7.2 years ago

i.sudbery 22k

We always remove duplicates from ChIP-seq data. If you sequencing is paired end, you'll want to do this in a paired-end aware manner. Normally this is done after mapping. We use MarkDuplicates from picard for ChIP-seq. samtools also has rmdup. We use picard because back in the day MarkDuplicates was more intelligent than rmdup about how it detected duplicates, but I don't know if that is still true. If you are using MACS for your peak-calling, you'll want to mark duplicates rather than remove them.

ADD COMMENT • link 7.2 years ago by i.sudbery 22k

1

Entering edit mode

As per Ian, for ChIP-seq, I have also always marked PCR / optical duplicates with Picard MarkDuplicates. You can then literally eliminate them from the BAM files with SAMtools:

#Identify and mark duplicates, and index new BAM
java -jar MarkDuplicates.jar INPUT=Aligned_Sorted.bam OUTPUT=Aligned_Sorted_PCRDupes.bam ASSUME_SORTED=true METRICS_FILE=Aligned_Sorted_PCRDupes.txt VALIDATION_STRINGENCY=SILENT ;
samtools index Aligned_Sorted_PCRDupes.bam ;

#Expunge marked duplicate reads, and then index new BAM
samtools view -b -F 0x400 Aligned_Sorted_PCRDupes.bam > Aligned_Sorted_PCRDuped.bam ;
samtools index Aligned_Sorted_PCRDuped.bam ;

As always, however, each experiment is unique and has its own intricacies. It may not, therefore, always be appropriate to eliminate reads that are identified as duplicates.

ADD REPLY • link 6.8 years ago by Kevin Blighe 89k

0

Entering edit mode

How do you distinguish PCR duplicates from "biological" duplicates ? You could loose 96% of your reads, that's a really hard filter. I mean in a whole genome analysis, then, OK you can filter out duplicates because you have a very low probability to sequence twice the same read, but in amplicon or chipseq this probability is very high.

ADD REPLY • link 7.2 years ago by Bastien Hervé 6.4k

3

Entering edit mode

Amplicon sequencing is very different to ChIP-seq. In ChIP-seq one would expect a protein to bind to thousands of locations. Also ChIP-seq doesn't return the precise location, so the binding site could be anywhere within a fragment. For a 300bp fragment, that gives 300 different fragments for a single site. Then account for the fact that fragments arn't a fixed size. Lets say your fragments are 250-300bp long. That gives you 15,000 possible read pairs for a single binding site. Now realise that a ChIP-seq peak probably contains more than one binding site, so you could be talking 30,000 possible read pairs per peak across thousands of peaks, lets say 10,000 peaks, that gives you 300 million possible read pairs for your 10,000 peaks. Now note that on average only around 10% of reads for ChIP-seq experiments fall into peaks. So there would be 3 billion possible unique reads pairs in a chip-seq experiment for a factor with 10,000 binding clusters using 2x75bp reads with a 250-300bp fragment size.

If your ChIP-seq experiment has a 96% duplication rate then there is something wrong with your data. ENCODE guidelines for ChIP-seq recommend only using samples where more than 80% of the read pairs are unique (i.e. less than 20% duplication rate).

There are experiments where biological duplicates are more likely and distinguishing between those and PCR duplicates is more important. For example, contrast the above with an amplicon sequencing whereby if you sequence 1000x500bp amplicons there are probably only 1 million possible read pairs even if you fragment (and many of those fragments will be pretty unlikely due to fragmentation bias). Fro such experiments one most either not deduplicate or include UMIs in your experimental design.

BTW RNA-seq is a very common technique where deduplication is not appropriate.

ADD REPLY • link 7.2 years ago by i.sudbery 22k

0

Entering edit mode

BTW RNA-seq is a very common technique where deduplication is not appropriate.

Yes, and certain DNA-seq library preps.

ADD REPLY • link 7.2 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank a lot for this very helpful comment. It took me around a hour to fully get the content with drawing and all.

Biologicaly I did not know that proteins could have so many binding sites. In my mind, proteins could have linked to a dozen binding sites not 10,000.

Do you have complementary informations about :

Now note that on average only around 10% of reads for ChIP-seq experiments fall into peaks

I did not understand this info.

I conclude that Chip-seq is more a genome scan rather than a genome panel (DNAseq).

Thanks again for the time

ADD REPLY • link 7.2 years ago by Bastien Hervé 6.4k

0

Entering edit mode

In only 1 situation did I observe a duplication rate that high, and it was due to the fact that the wet-lab immunologist had PCR amplified the same sample multiple times.

ADD REPLY • link 6.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Maybe it is too easy. Can I just use bam file from the first command below, then use bam to do peak calling, if i do not use samtools index? $ java -jar MarkDuplicates.jar INPUT=Aligned_Sorted.bam OUTPUT=Aligned_Sorted_PCRDupes.bam ASSUME_SORTED=true METRICS_FILE=Aligned_Sorted_PCRDupes.txt VALIDATION_STRINGENCY=SILENT ;

ADD REPLY • link 7.2 years ago by mikysyc2016 ▴ 120

0

Entering edit mode

Won't removing duplicate in short single-end ChIP-seq experiments put an effective ceiling on your coverage in enriched regions? There's only room for so many unique 75-bp reads over a 200bp region.

ADD REPLY • link 6.8 years ago by eric.fournier • 0

1

Entering edit mode

Yes. Don't do short read single-end ChIP-seq.

ADD REPLY • link 6.8 years ago by i.sudbery 22k

score 1 · Answer 2 · 2018-06-06

As suggested in this post, ~~you expect to have duplicates in Chip-seq data because you sequenced a very small part of the genome.~~ It will all depends of your coverage.

Try to find the proportion of duplicates you have. If you got 98% of duplicates, try the following :

A good way to catch PCR duplicates, @harold.smith.tarheel answer from the post above : "You can discriminate via genome browser of your non-deduplicated data. Bona fide peaks will have multiple overlapping reads with offsets, while samples with only PCR duplicates will stack up perfectly without offsets."

If you got "samples with only PCR duplicates will stack up perfectly without offsets." that will be a problem (or at least you will have to choose if you keep duplicates or not). In the other way if you got "multiple overlapping reads with offsets" you can keep duplicates.