Question

Getting unique reads from multiple fastq files

0

Entering edit mode

8.2 years ago

Erika Gedvilaite • 0

Hello to the Galaxy community,

I was wondering what is the quickest and the simplest way of extracting fastq unique reads from two fastq files.

What I have is: 2 fastq files with sequences and their quality scores What I want: one fastq file that has only the unique reads that are seen in the first fastq file, but not the second.

What would be the way to go around it? Both of the files have 42M reads each.

Thank you in advance for all of the help.

Erika

sequence • 4.4k views

ADD COMMENT • link 8.2 years ago by Erika Gedvilaite • 0

2

Entering edit mode

This is Biostars community :-)

Dedupe.sh from BBMap package.

ADD REPLY • link 8.2 years ago by GenoMax 147k

0

Entering edit mode

I second @genomax2's recommendation for dedup.sh. I totally forgot that it's a sequence-based (as opposed to alignment-based) deduplicator. I would definitely try this tool first.

ADD REPLY • link 8.2 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Thanks to all!

harold.smith.tarheel, your explanation was very detailed and helpful, I'll see what I can get done with the data that I am working with.

ADD REPLY • link 8.2 years ago by Erika Gedvilaite • 0

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing comments. This keeps threads logically organized.

ADD REPLY • link 8.2 years ago by GenoMax 147k

score 2 · Accepted Answer · 2016-09-29

FastUniq and sequniq can parse paired-end read data for duplicated sequences, while FASTX Toolkit collapses single-end data into unique reads.. For your application, you may have to add read groups using Picard, deduplicate each individually, merge, deduplicate, and split by read group. I don't believe any of those tools takes quality scores into account, and you may have to tweak the data so it returns only the copy from the first FASTQ file (e.g., FastUniq returns the longer of two duplicates, so you could trim the data in the second FASTQ to ensure that outcome).