Common advice in DNA-seq experiments is to remove duplicate reads. These are presumed to be optical or PCR duplicates. However, when samples are sequenced deeply (more than 10X), it becomes completely reasonable for reads to be duplicated. If we stick with the idea of throwing away duplicates, it effectively limits the sequencing depth to 1X.
In cases where there is a dramatic shift to higher GC from input and a strongly skewed distribution, I think clearly one feels inclined to remove the duplicates.
However, in many of the deep sequencing datasets I work with, I see very little shift to higher GC with histograms of GC content that are very nearly symmetric and quite close to the Input.
In these cases, I often feel that I should stick with leaving the duplicated reads in. On the other hand, for certain regions of the genome, I see huge numbers of tags leading to overlapping-tag counts in the tens of thousands. These seem not to represent genuine biology.
What solutions are there for us who would like to use deep sequencing, but what a prrincipled way to filter out some of these clear artifacts?
Read this post. You misunderstood the purpose of duplicate removal. As to GC bias, you can barely detect the bias by comparing the GC content of the genome and that of reads. For Illumina, typical GC bias denotes significantly lower coverage at extremely high or low GC.
@lh3:Sorry to be a bother. Can you specify the aspect I am misunderstanding?
Removing duplicates has nothing to do with GC content. Read that post first. EDIT: don't read the first few. Finish the whole thread. There are quite a lot of information mentioned by others and myself.
One of the early posts says "The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction". The first thing I say is "Common advice in DNA-seq experiments is to remove duplicate reads. These are presumed to be optical or PCR duplicates. "
@lh3: As to your comment about GC content. PCR duplicates cause a change in the distribution of GC content of tags. Look at this paper: http://nar.oxfordjournals.org/content/early/2012/02/08/nar.gks001.long .... "This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias". We can quibble about whether it's the most important cause, but it seems reasonable to consider it a contributor to the distribution of GC.
Also. I do get that there are other reasons in play like increasing sensitivity and specificity of peak calls ... at least that's what I get from this paper ... http://www.nature.com/nmeth/journal/v9/n6/full/nmeth.1985.html
@lh3: Also, I just wanted to say I have actually read that thread before. Rereading it now, I'm thinking I probably learned about optical duplicates from you!