Hi all,
I've previously called variants from WGS data but am now tasked with using single-end ddRAD-seq data. It seems STACKS is popular for this purpose, but I'm confused by how it handles the high levels of read duplication that seem inherent to RAD-seq.
The only reference to duplicates in the manual here https://catchenlab.life.illinois.edu/stacks/manual/ is a code snippet that says:
We have instructed gstacks to remove any PCR duplicates that it finds.
The relevant option is:
gstack ..... --rm-pcr-duplicates
I may have a fundamental misunderstanding of RAD-seq, but if reads must always start at a restriction site, wouldn't any 2-fold or greater coverage of a given loci rely on there being biological duplicate reads? In other words, how can you get decent read depth at a loci without having biological duplicate reads?
If this is the case, wouldn't removing apparent 'PCR duplicates' also remove all real biological duplicates that would support genotype calls?
Hi Andres,
Thanks for your response, it's really nice to know that someone else has had this issue and thought a bit about the problem. I was starting to think I was being a bit crazy for worrying about this.
Making the situation worse, the data I am working with is single-end data, so I cannot even use varying reverse reads and insert lengths to distinguish biological duplicates from PCR dups... I think I may have to just omit that argument from STACKS as you did and accept this as a limitation of the study.
Yes I agree it can distort allele frequencies, that's a definitely a concern. Fortunately, my species is a self-fertilising crop so it is highly homozygous. Hopefully I therefore won't accidentally lose too many heterozygous alleles during variant calling... A bit frustrating though! Could you explain a little why a strict variant calling QC may help?
Many thanks, Max