Hello,
I have an old dataset from 2010 of PE illumina 54bp reads with a lot of PCR duplicates. These pairs of duplicates are very obvious, they are exactly the same read sequence forward and reverse present several times with different read names.
I know how to get rid of them using a bam alignment/mapping, but I am interested in methods to remove these without an alignment since I am interested on doing analysis on all reads, not just those that align to the genome.
What are some available approaches that take as input fastq and output fastq?
Thank you,
Adrian
Also, PRINSEQ
This worked:
Check out FastUniq
That's a bit odd that the max is 1000 pairs.
Just for the record: FastUniq can not account for sequencing errors (which can be a strong limitation). Here is a quote from the authors' article (Xu _et al._, 2012).
Hi , do you know the same function tools written by python ?