Removing PCR duplicates from .fastq without .bam alignment
1
5
Entering edit mode
10.0 years ago
Adrian Pelin ★ 2.6k

Hello,

I have an old dataset from 2010 of PE illumina 54bp reads with a lot of PCR duplicates. These pairs of duplicates are very obvious, they are exactly the same read sequence forward and reverse present several times with different read names.

I know how to get rid of them using a bam alignment/mapping, but I am interested in methods to remove these without an alignment since I am interested on doing analysis on all reads, not just those that align to the genome.

What are some available approaches that take as input fastq and output fastq?

Thank you,

Adrian

PCR duplicates illumina fastq • 14k views
ADD COMMENT
2
Entering edit mode

Also, PRINSEQ

ADD REPLY
0
Entering edit mode

This worked:

perl prinseq-lite.pl -fastq ~/Encephalitozoon/Eromalae/100611_s_4_1_seq_GDR-7.fastq -fastq2 ~/Encephalitozoon/Eromalae/100611_s_4_2_seq_GDR-7.fastq -phred64 -derep 1
ADD REPLY
1
Entering edit mode

Check out FastUniq

ADD REPLY
0
Entering edit mode
perl prinseq-lite.pl -fastq ~/Encephalitozoon/Eromalae/100611_s_4_1_seq_GDR-7.fastq -fastq2 ~/Encephalitozoon/Eromalae/100611_s_4_2_seq_GDR-7.fastq -phred64 -derep 1

That's a bit odd that the max is 1000 pairs.

ADD REPLY
0
Entering edit mode

Just for the record: FastUniq can not account for sequencing errors (which can be a strong limitation). Here is a quote from the authors' article (Xu _et al._, 2012).

There were some differences in levels of duplicates identified by FastUniq and Picard Markduplicates that were caused by the different criteria in read pair comparisons (Figure 3A, Table 1). Of them, FastUniq compares read pairs on the basis of sequences only, and it is sensitive to SNPs caused by heterozygous or sequencing errors.

ADD REPLY
0
Entering edit mode

Hi , do you know the same function tools written by python ?

ADD REPLY
7
Entering edit mode
7.9 years ago

Clumpify can mark or remove duplicate reads very efficiently without alignment:

clumpify.sh in=reads.fq out=deduped.fq dedupe

ADD COMMENT

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6