Question

handling PCR duplicate reads ?

1

Entering edit mode

9.4 years ago

GouthamAtla 12k

I have been wondering how to handle the pcr duplicate reads in single-end RNA-Seq or Chip-Seq data sets. For single-end data, it is not advisable to remove duplicates just by looking at the start position of reads. Many posts/blogs I read suggests to account for the duplicate reads while counting the data instead of removing them. Anybody did this before? accounting for duplicate reads while counting (gene/exon level). Are there any tools?

One more thing I am wondering about is, if we do pre-filtering of the data, there would be some reads that will be trimed off either at the end or at the beginning due to poor quality bases. So these reads will not have similar start/end positions when mapped to genome. tools like picard MarkDuplicates would not recognise them as duplicates, even if they are PCR duplicates, as they have different start and end coordinates (in fact, different CIGAR string, due to difference in length of the read). How everyone is handling this? Assuming that PCR duplicates will not have significant effect on the end results is one way to go?

RNA-Seq ChIP-Seq • 4.6k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.4 years ago by GouthamAtla 12k

0

Entering edit mode

It can depend quite a lot on the antibody you use. Sharp peaks have a limit number of reads that could possibly be under them, and thus reaching saturation is more likely. Deleting duplicates here would be a bad thing. On broader marks however, I would just delete them, because you'll never hit saturation (at least not in a meaningful place).

I imagine you could probably get a good estimate of the duplication rate by looking at reads with little coverage across the genome. Like, I dont know, taking all reads in regions that, when piled up, has less than X reads worth of signal (where X is the 2^c - where c is the number of pcr cycles run to make the libraries). Use these reads to determine duplication frequency. I know deeptools can correct for GC bias (something you probably want to do anyway) so maybe it can also be given a static value in addition when correcting biases. I dont know. Definitely do your filtering before trimming, for all the reasons you suggested. Maybe do it afterwards too, since you might detect "new" duplicates after trimming.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by John 13k

Ram · Answer 1 · 2015-12-02

My thoughts on this:

For RNA-Seq I keep everything, duplicates or not, SE or PE.
For ChIP-Seq and similar enrichment experiments (FAIRE, ATAC etc.) I remove duplicates. In theory you do expect duplication since you sequence quite deep a smallish proportion of the genome. In practice I get better signal to noise ratio without duplicates, at least up to ~100M reads per library (mammalian genome). This is based on visual inspection, nothing sophisticated.
About marking single end reads, there was a discussion started by me here Mark duplicates for single end reads: Why only 5'end?