Question

Removing PCR duplicates from .fastq without .bam alignment

5

Entering edit mode

10.4 years ago

Adrian Pelin ★ 2.7k

Hello,

I have an old dataset from 2010 of PE illumina 54bp reads with a lot of PCR duplicates. These pairs of duplicates are very obvious, they are exactly the same read sequence forward and reverse present several times with different read names.

I know how to get rid of them using a bam alignment/mapping, but I am interested in methods to remove these without an alignment since I am interested on doing analysis on all reads, not just those that align to the genome.

What are some available approaches that take as input fastq and output fastq?

Thank you,

Adrian

PCR duplicates illumina fastq • 15k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Adrian Pelin ★ 2.7k

2

Entering edit mode

Also, PRINSEQ

ADD REPLY • link 10.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

This worked:

perl prinseq-lite.pl -fastq ~/Encephalitozoon/Eromalae/100611_s_4_1_seq_GDR-7.fastq -fastq2 ~/Encephalitozoon/Eromalae/100611_s_4_2_seq_GDR-7.fastq -phred64 -derep 1

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Adrian Pelin ★ 2.7k

1

Entering edit mode

Check out FastUniq

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by lkmklsmn ▴ 980

0

Entering edit mode

perl prinseq-lite.pl -fastq ~/Encephalitozoon/Eromalae/100611_s_4_1_seq_GDR-7.fastq -fastq2 ~/Encephalitozoon/Eromalae/100611_s_4_2_seq_GDR-7.fastq -phred64 -derep 1

That's a bit odd that the max is 1000 pairs.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Just for the record: FastUniq can not account for sequencing errors (which can be a strong limitation). Here is a quote from the authors' article (Xu _et al._, 2012).

There were some differences in levels of duplicates identified by FastUniq and Picard Markduplicates that were caused by the different criteria in read pair comparisons (Figure 3A, Table 1). Of them, FastUniq compares read pairs on the basis of sequences only, and it is sensitive to SNPs caused by heterozygous or sequencing errors.

ADD REPLY • link 7.6 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Hi , do you know the same function tools written by python ?

ADD REPLY • link 8.3 years ago by kaixian110 • 0

score 7 · Answer 1 · 2017-01-18

7

Entering edit mode

8.3 years ago

Brian Bushnell 20k

Clumpify can mark or remove duplicate reads very efficiently without alignment:

clumpify.sh in=reads.fq out=deduped.fq dedupe

ADD COMMENT • link 8.3 years ago by Brian Bushnell 20k