Question

Remove Duplicate Reads from paired end data set

0

Entering edit mode

8.1 years ago

Tonor ▴ 480

I would like to remove duplicates from a paired end ngs data set (illumina).

This is an environmental sample, so no host to map to first - so I don't think tools such as Picard and samtools will work to remove duplicates.

I have been testing prinseq - but it doesn't seem to work properly.

Paired file 1:

@Seq1/1
AACCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq2/1
AACCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq3/1
AACCGGTTAACCGGTTAACCGGTA
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq4/1
AACCGGTTAACCGGTTAACCGGTC
+
HHHHHHHHHHHHHHHHHHHHHHHH
@Seq5/1
CCGGTTAACCGGTTAACCGGTT
+
HHHHHHHHHHHHHHHHHHHHHH
@Seq6/1
AACCGGTTAACCGGTTAACCGG
+
HHHHHHHHHHHHHHHHHHHHHH

Seq 2 is an exact replicate of Seq1, Seq3 and Seq4 are unique, Seq 5 is 3' duplicate, and Seq6 is a 5' duplicate. If you run prinseq on this file alone it seems to work great - the 3 duplciates are removed:

prinseq-lite.pl -derep 12345 -fastq test1.fastq

But when you add the 2nd paired file in (all seqs the same):

@Seq1/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq2/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq3/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq4/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq5/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@Seq6/2
TTTTTTTAAAAAAATTTTTTTAAAAAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH

Run prinseq:

prinseq-lite.pl -derep 12345 -fastq test1.fastq -fastq2 test2.fastq

Prinseq now only removes the exact duplicate sequence (Seq1 and Seq2 are duplicates). Why is not removing Seq5 and Seq6? In my mind it should be.

Even if the paired reads are concatenated together - Seq5 should still be removed as it would be a 3' duplicate.

Anyone got any ideas? Maybe in paired end mode prinseq only deals with exact duplicates - or maybe I'm doing something wrong.

next-gen sequence • 7.8k views

ADD COMMENT • link updated 8.0 years ago by shenwei356 8.7k • written 8.1 years ago by Tonor ▴ 480

2

Entering edit mode

I am not going to answer your question directly but you may be able to use a work around. bbduk.sh/dedupe.sh from BBMap may be an alternate option to try. This will require a significant amount of memory depending on the size of your dataset.

Otherwise you can let prin-seq do its thing with R1. Then use repair.sh tool from BBMap to bring the R1/R2 files back in sync.

ADD REPLY • link 8.1 years ago by GenoMax 147k

0

Entering edit mode

Thanks for the suggestion(s) - I was going to resort to running prinseq individually on each read file and then merge - but was just wondering if there were other tools available - shall check out BBMap

ADD REPLY • link 8.1 years ago by Tonor ▴ 480

score 2 · Answer 1 · 2016-10-26

2

Entering edit mode

8.1 years ago

Rohit ★ 1.5k

If you want to remove the exact duplicate sequences, FasUniq might be a good choice.

ADD COMMENT • link 8.1 years ago by Rohit ★ 1.5k

0

Entering edit mode

Thanks - I hadn't heard of fastuniq - but it sounds like it is doing the same as prinseq at the moment - removed the pairs that are exact matches - I also want it to remove the pairs that are fully contained within other pairs - but it might be to computationally intensive

ADD REPLY • link 8.1 years ago by Tonor ▴ 480

1

Entering edit mode

I think FastUniq also removes the subset of the sequence but only matching from the 5' end, not sure if there is a better alternative for the rest of them :|

ADD REPLY • link 8.1 years ago by Rohit ★ 1.5k

score 1 · Answer 2 · 2016-11-27

1

Entering edit mode

8.0 years ago

shenwei356 8.7k

Do you want to normalize metagenomic reads?

diginorm-separation

You may try khmer from Dr. C. Titus Brown.

Related docs:

2014-khmer-protocols
What is digital normalization, anyway?, Why you shouldn't use digital normalization, more posts about digital normalization.
Protocols: Running digital normalization

ADD COMMENT • link 8.0 years ago by shenwei356 8.7k

0

Entering edit mode

Great suggestion I shall take a look - it is related to metagenomics - trying to measure the diversity of taxons present within a sample based on the reads (prior to de novo) - if one assumes that any duplicated read is not real and a product of some sort of PCR bias then they need to be removed.

ADD REPLY • link 8.0 years ago by Tonor ▴ 480

0

Entering edit mode

it seems like removing chimera sequences in 16s sequencing data. did you try qiime?

ADD REPLY • link 8.0 years ago by shenwei356 8.7k

0

Entering edit mode

Swarm works for metagenomic samples, just in-case :)

ADD REPLY • link 8.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

Hi - Have you tried Swarm? It says amplicon based - but would it work on shotgun metagenomics?

ADD REPLY • link 8.0 years ago by Tonor ▴ 480

0

Entering edit mode

Yes, I have tried swarm but my purpose was completely different (was reducing data-sets of sequences than reads - for which vmatch is more suitable). Their testing was done on paired-end metagenomic samples too, so it must serve the purpose.

ADD REPLY • link 8.0 years ago by Rohit ★ 1.5k