Hi, when you analysis ChIP-seq data(fastq file). Did you remove duplicates from the data? which command and software you used? Thanks!
Hi, when you analysis ChIP-seq data(fastq file). Did you remove duplicates from the data? which command and software you used? Thanks!
We always remove duplicates from ChIP-seq data. If you sequencing is paired end, you'll want to do this in a paired-end aware manner. Normally this is done after mapping. We use MarkDuplicates
from picard for ChIP-seq. samtools
also has rmdup
. We use picard because back in the day MarkDuplicates
was more intelligent than rmdup
about how it detected duplicates, but I don't know if that is still true. If you are using MACS
for your peak-calling, you'll want to mark duplicates rather than remove them.
As suggested in this post, you expect to have duplicates in Chip-seq data because you sequenced a very small part of the genome. It will all depends of your coverage.
Try to find the proportion of duplicates you have. If you got 98% of duplicates, try the following :
A good way to catch PCR duplicates, @harold.smith.tarheel answer from the post above : "You can discriminate via genome browser of your non-deduplicated data. Bona fide peaks will have multiple overlapping reads with offsets, while samples with only PCR duplicates will stack up perfectly without offsets."
If you got "samples with only PCR duplicates will stack up perfectly without offsets." that will be a problem (or at least you will have to choose if you keep duplicates or not). In the other way if you got "multiple overlapping reads with offsets" you can keep duplicates.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
As per Ian, for ChIP-seq, I have also always marked PCR / optical duplicates with
Picard MarkDuplicates
. You can then literally eliminate them from the BAM files with SAMtools:As always, however, each experiment is unique and has its own intricacies. It may not, therefore, always be appropriate to eliminate reads that are identified as duplicates.
How do you distinguish PCR duplicates from "biological" duplicates ? You could loose 96% of your reads, that's a really hard filter. I mean in a whole genome analysis, then, OK you can filter out duplicates because you have a very low probability to sequence twice the same read, but in amplicon or chipseq this probability is very high.
Amplicon sequencing is very different to ChIP-seq. In ChIP-seq one would expect a protein to bind to thousands of locations. Also ChIP-seq doesn't return the precise location, so the binding site could be anywhere within a fragment. For a 300bp fragment, that gives 300 different fragments for a single site. Then account for the fact that fragments arn't a fixed size. Lets say your fragments are 250-300bp long. That gives you 15,000 possible read pairs for a single binding site. Now realise that a ChIP-seq peak probably contains more than one binding site, so you could be talking 30,000 possible read pairs per peak across thousands of peaks, lets say 10,000 peaks, that gives you 300 million possible read pairs for your 10,000 peaks. Now note that on average only around 10% of reads for ChIP-seq experiments fall into peaks. So there would be 3 billion possible unique reads pairs in a chip-seq experiment for a factor with 10,000 binding clusters using 2x75bp reads with a 250-300bp fragment size.
If your ChIP-seq experiment has a 96% duplication rate then there is something wrong with your data. ENCODE guidelines for ChIP-seq recommend only using samples where more than 80% of the read pairs are unique (i.e. less than 20% duplication rate).
There are experiments where biological duplicates are more likely and distinguishing between those and PCR duplicates is more important. For example, contrast the above with an amplicon sequencing whereby if you sequence 1000x500bp amplicons there are probably only 1 million possible read pairs even if you fragment (and many of those fragments will be pretty unlikely due to fragmentation bias). Fro such experiments one most either not deduplicate or include UMIs in your experimental design.
BTW RNA-seq is a very common technique where deduplication is not appropriate.
Yes, and certain DNA-seq library preps.
Thank a lot for this very helpful comment. It took me around a hour to fully get the content with drawing and all.
Biologicaly I did not know that proteins could have so many binding sites. In my mind, proteins could have linked to a dozen binding sites not 10,000.
Do you have complementary informations about :
I did not understand this info.
I conclude that Chip-seq is more a genome scan rather than a genome panel (DNAseq).
Thanks again for the time
In only 1 situation did I observe a duplication rate that high, and it was due to the fact that the wet-lab immunologist had PCR amplified the same sample multiple times.
Maybe it is too easy. Can I just use bam file from the first command below, then use bam to do peak calling, if i do not use samtools index? $ java -jar MarkDuplicates.jar INPUT=Aligned_Sorted.bam OUTPUT=Aligned_Sorted_PCRDupes.bam ASSUME_SORTED=true METRICS_FILE=Aligned_Sorted_PCRDupes.txt VALIDATION_STRINGENCY=SILENT ;
Won't removing duplicate in short single-end ChIP-seq experiments put an effective ceiling on your coverage in enriched regions? There's only room for so many unique 75-bp reads over a 200bp region.
Yes. Don't do short read single-end ChIP-seq.