I have been wondering how to handle the pcr duplicate reads in single-end RNA-Seq or Chip-Seq data sets. For single-end data, it is not advisable to remove duplicates just by looking at the start position of reads. Many posts/blogs I read suggests to account for the duplicate reads while counting the data instead of removing them. Anybody did this before? accounting for duplicate reads while counting (gene/exon level). Are there any tools?
One more thing I am wondering about is, if we do pre-filtering of the data, there would be some reads that will be trimed off either at the end or at the beginning due to poor quality bases. So these reads will not have similar start/end positions when mapped to genome. tools like picard MarkDuplicates would not recognise them as duplicates, even if they are PCR duplicates, as they have different start and end coordinates (in fact, different CIGAR string, due to difference in length of the read). How everyone is handling this? Assuming that PCR duplicates will not have significant effect on the end results is one way to go?
It can depend quite a lot on the antibody you use. Sharp peaks have a limit number of reads that could possibly be under them, and thus reaching saturation is more likely. Deleting duplicates here would be a bad thing. On broader marks however, I would just delete them, because you'll never hit saturation (at least not in a meaningful place).
I imagine you could probably get a good estimate of the duplication rate by looking at reads with little coverage across the genome. Like, I dont know, taking all reads in regions that, when piled up, has less than X reads worth of signal (where X is the 2^c - where c is the number of pcr cycles run to make the libraries). Use these reads to determine duplication frequency. I know deeptools can correct for GC bias (something you probably want to do anyway) so maybe it can also be given a static value in addition when correcting biases. I dont know. Definitely do your filtering before trimming, for all the reasons you suggested. Maybe do it afterwards too, since you might detect "new" duplicates after trimming.