Getting unique reads from multiple fastq files
1
0
Entering edit mode
8.2 years ago

Hello to the Galaxy community,

I was wondering what is the quickest and the simplest way of extracting fastq unique reads from two fastq files.

What I have is: 2 fastq files with sequences and their quality scores What I want: one fastq file that has only the unique reads that are seen in the first fastq file, but not the second.

What would be the way to go around it? Both of the files have 42M reads each.

Thank you in advance for all of the help.

Erika

sequence • 4.5k views
ADD COMMENT
2
Entering edit mode

This is Biostars community :-)

Dedupe.sh from BBMap package.

ADD REPLY
0
Entering edit mode

I second @genomax2's recommendation for dedup.sh. I totally forgot that it's a sequence-based (as opposed to alignment-based) deduplicator. I would definitely try this tool first.

ADD REPLY
0
Entering edit mode

Thanks to all!

harold.smith.tarheel, your explanation was very detailed and helpful, I'll see what I can get done with the data that I am working with.

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing comments. This keeps threads logically organized.

ADD REPLY
2
Entering edit mode
8.2 years ago

FastUniq and sequniq can parse paired-end read data for duplicated sequences, while FASTX Toolkit collapses single-end data into unique reads.. For your application, you may have to add read groups using Picard, deduplicate each individually, merge, deduplicate, and split by read group. I don't believe any of those tools takes quality scores into account, and you may have to tweak the data so it returns only the copy from the first FASTQ file (e.g., FastUniq returns the longer of two duplicates, so you could trim the data in the second FASTQ to ensure that outcome).

ADD COMMENT

Login before adding your answer.

Traffic: 1918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6