To my understanding, targeted sequencing is an enriched library. However, it is not clear to me if the duplicate sequence plot from FastQC is actually biased for this type of sequencing. I am aware that for RNA-seq libraries duplication events are expected, as these will account for highly expressed genes. However, if we are sequencing a specific set of regions and amplifying them several times, I should as well expect this same pattern. Am I right? Moreover, most of the sequences should fall into the region of more than 1 copy?
I have found some explanation where it was mentioned that "In both the raw and deduplicated versions of the library the vast majority of reads come from sequences which only occur once within the library- this will be true for Whole-Genome sequencing for example, suggesting that there is a diverse population". However, for targeted capture sequencing, we do not have a diverse library so we expect high duplication levels?
Thanks.
Duplicates are normal and expected in targeted experiments. I personally do not often perform targeted sequencing, but I hear experienced people say to leave duplicated untouched as the false-negative rate after removing duplicates does not justify the reduction in false-positives.
In my experience with RRBS, which is an enriched sequencing (in this case with bisulfite), high levels of duplication are usually observed and not removed. ("High levels" such as for example, the sequence duplication levels plot in FastQC may often say that the "percent of seqs remaining if deduplicated" is less than 20-30%)