Question

STACKS PCR duplicate removal

0

Entering edit mode

14 months ago

maxrwjones ▴ 60

Hi all,

I've previously called variants from WGS data but am now tasked with using single-end ddRAD-seq data. It seems STACKS is popular for this purpose, but I'm confused by how it handles the high levels of read duplication that seem inherent to RAD-seq.

The only reference to duplicates in the manual here https://catchenlab.life.illinois.edu/stacks/manual/ is a code snippet that says:

We have instructed gstacks to remove any PCR duplicates that it finds.

The relevant option is:

gstack ..... --rm-pcr-duplicates

I may have a fundamental misunderstanding of RAD-seq, but if reads must always start at a restriction site, wouldn't any 2-fold or greater coverage of a given loci rely on there being biological duplicate reads? In other words, how can you get decent read depth at a loci without having biological duplicate reads?

If this is the case, wouldn't removing apparent 'PCR duplicates' also remove all real biological duplicates that would support genotype calls?

PCR VCF STACKS RAD-seq • 971 views

ADD COMMENT • link 14 months ago by maxrwjones ▴ 60

score 1 · Answer 1 · 2024-04-03

1

Entering edit mode

14 months ago

Andres ▴ 20

I had the same problem before. I don't remember well, but i ignored that option for ddRAD-seq data. For single-cut RAD-seq data is not that big of a deal to remove identical reads, as the reverse read of a loci may fall randomly around the insert length of the read. But with ddRAD-seq the insert length is going to be more or less static, you would get tons of identical reads.

I think is not possible to identify PCR duplicates in this case, is an error to take into account, as it can affect allele frequencies later on. ¿Maybe a strict variant calling QC may help?. Don't know.

ADD COMMENT • link 14 months ago by Andres ▴ 20

0

Entering edit mode

Hi Andres,

Thanks for your response, it's really nice to know that someone else has had this issue and thought a bit about the problem. I was starting to think I was being a bit crazy for worrying about this.

Making the situation worse, the data I am working with is single-end data, so I cannot even use varying reverse reads and insert lengths to distinguish biological duplicates from PCR dups... I think I may have to just omit that argument from STACKS as you did and accept this as a limitation of the study.

Yes I agree it can distort allele frequencies, that's a definitely a concern. Fortunately, my species is a self-fertilising crop so it is highly homozygous. Hopefully I therefore won't accidentally lose too many heterozygous alleles during variant calling... A bit frustrating though! Could you explain a little why a strict variant calling QC may help?

Many thanks, Max

ADD REPLY • link 14 months ago by maxrwjones ▴ 60