Remove Duplicate Reads from paired end data set
2
0
Entering edit mode
8.1 years ago
Tonor ▴ 480

I would like to remove duplicates from a paired end ngs data set (illumina).

This is an environmental sample, so no host to map to first - so I don't think tools such as Picard and samtools will work to remove duplicates.

I have been testing prinseq - but it doesn't seem to work properly.

Paired file 1:

@Seq1/1
AACCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq2/1
AACCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq3/1
AACCGGTTAACCGGTTAACCGGTA
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq4/1
AACCGGTTAACCGGTTAACCGGTC
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq5/1
CCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHH
@Seq6/1
AACCGGTTAACCGGTTAACCGG
+
HHHHHHHHHHHHHHHHHHHHHH

Seq 2 is an exact replicate of Seq1, Seq3 and Seq4 are unique, Seq 5 is 3' duplicate, and Seq6 is a 5' duplicate. If you run prinseq on this file alone it seems to work great - the 3 duplciates are removed:

prinseq-lite.pl -derep 12345 -fastq test1.fastq

But when you add the 2nd paired file in (all seqs the same):

@Seq1/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq2/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq3/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq4/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq5/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq6/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH

Run prinseq:

prinseq-lite.pl -derep 12345 -fastq test1.fastq -fastq2 test2.fastq

Prinseq now only removes the exact duplicate sequence (Seq1 and Seq2 are duplicates). Why is not removing Seq5 and Seq6? In my mind it should be.

Even if the paired reads are concatenated together - Seq5 should still be removed as it would be a 3' duplicate.

Anyone got any ideas? Maybe in paired end mode prinseq only deals with exact duplicates - or maybe I'm doing something wrong.

next-gen sequence • 7.7k views
ADD COMMENT
2
Entering edit mode

I am not going to answer your question directly but you may be able to use a work around. bbduk.sh/dedupe.sh from BBMap may be an alternate option to try. This will require a significant amount of memory depending on the size of your dataset.

Otherwise you can let prin-seq do its thing with R1. Then use repair.sh tool from BBMap to bring the R1/R2 files back in sync.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion(s) - I was going to resort to running prinseq individually on each read file and then merge - but was just wondering if there were other tools available - shall check out BBMap

ADD REPLY
2
Entering edit mode
8.1 years ago
Rohit ★ 1.5k

If you want to remove the exact duplicate sequences, FasUniq might be a good choice.

ADD COMMENT
0
Entering edit mode

Thanks - I hadn't heard of fastuniq - but it sounds like it is doing the same as prinseq at the moment - removed the pairs that are exact matches - I also want it to remove the pairs that are fully contained within other pairs - but it might be to computationally intensive

ADD REPLY
1
Entering edit mode

I think FastUniq also removes the subset of the sequence but only matching from the 5' end, not sure if there is a better alternative for the rest of them :|

ADD REPLY
1
Entering edit mode
8.0 years ago

Do you want to normalize metagenomic reads?

diginorm-separation

You may try khmer from Dr. C. Titus Brown.

Related docs:

ADD COMMENT
0
Entering edit mode

Great suggestion I shall take a look - it is related to metagenomics - trying to measure the diversity of taxons present within a sample based on the reads (prior to de novo) - if one assumes that any duplicated read is not real and a product of some sort of PCR bias then they need to be removed.

ADD REPLY
0
Entering edit mode

it seems like removing chimera sequences in 16s sequencing data. did you try qiime?

ADD REPLY
0
Entering edit mode

Swarm works for metagenomic samples, just in-case :)

ADD REPLY
0
Entering edit mode

Hi - Have you tried Swarm? It says amplicon based - but would it work on shotgun metagenomics?

ADD REPLY
0
Entering edit mode

Yes, I have tried swarm but my purpose was completely different (was reducing data-sets of sequences than reads - for which vmatch is more suitable). Their testing was done on paired-end metagenomic samples too, so it must serve the purpose.

ADD REPLY

Login before adding your answer.

Traffic: 2965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6