I would like to remove duplicates from a paired end ngs data set (illumina).
This is an environmental sample, so no host to map to first - so I don't think tools such as Picard and samtools will work to remove duplicates.
I have been testing prinseq - but it doesn't seem to work properly.
Paired file 1:
@Seq1/1
AACCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq2/1
AACCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq3/1
AACCGGTTAACCGGTTAACCGGTA
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq4/1
AACCGGTTAACCGGTTAACCGGTC
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq5/1
CCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHH
@Seq6/1
AACCGGTTAACCGGTTAACCGG
+
HHHHHHHHHHHHHHHHHHHHHH
Seq 2 is an exact replicate of Seq1, Seq3 and Seq4 are unique, Seq 5 is 3' duplicate, and Seq6 is a 5' duplicate. If you run prinseq on this file alone it seems to work great - the 3 duplciates are removed:
prinseq-lite.pl -derep 12345 -fastq test1.fastq
But when you add the 2nd paired file in (all seqs the same):
@Seq1/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq2/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq3/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq4/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq5/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq6/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
Run prinseq:
prinseq-lite.pl -derep 12345 -fastq test1.fastq -fastq2 test2.fastq
Prinseq now only removes the exact duplicate sequence (Seq1 and Seq2 are duplicates). Why is not removing Seq5 and Seq6? In my mind it should be.
Even if the paired reads are concatenated together - Seq5 should still be removed as it would be a 3' duplicate.
Anyone got any ideas? Maybe in paired end mode prinseq only deals with exact duplicates - or maybe I'm doing something wrong.
I am not going to answer your question directly but you may be able to use a work around. bbduk.sh/dedupe.sh from BBMap may be an alternate option to try. This will require a significant amount of memory depending on the size of your dataset.
Otherwise you can let prin-seq do its thing with R1. Then use repair.sh tool from BBMap to bring the R1/R2 files back in sync.
Thanks for the suggestion(s) - I was going to resort to running prinseq individually on each read file and then merge - but was just wondering if there were other tools available - shall check out BBMap