Hi,
I'm trying to get variants from amplicon-based sequencing reads. These reads have: primer adapters and barcodes on both ends. I'm looking into the GATK pipeline and Samtools/VarScan pipelines.
I was able to remove the primer sequences on both sides using cutadapt.
Next, I aligned my reads using BWA-mem. Then, I removed duplicate reads (to remove PCR duplicates) using SamTools' markdup. However, aligning removed the barcodes on both ends and deduplicating removed most of my reads. I'm looking into Picard's MarkDuplicates, but that also does not seem to be applicable to amplicon-based reads because it's based on the start position of the reads and would delete a majority of my reads.
Is there any way to remove identical sequences for amplicon-based reads? Furthermore, I want the barcode identifiers to remain after aligning. How would I do that?
Thank you!
Do not remove duplicates with amplicon data, your reads are, by definition, all duplicates. Aligning with BWA should not remove the barcodes, they should have been soft-clipped, but should still be there.
You want to keep one correct read, and lots of reads with errors? Why remove identical reads?
I want to delete reads with the exact same sequences (including barcodes) so that I can eliminate any PCR duplicates. I want to make sure that my future variant analysis is not biased because of PCR duplicates.
You have amplicon-based reads. By definition, all reads you see are PCR duplicates.
Hello newbinf,
Don't forget to follow up on your threads.
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.