Question

greping unique reads from fastq file

0

Entering edit mode

9.5 years ago

sumithrasank75 ▴ 140

I have a fastq file with reads, but there are duplicates. Can you tell me how I can get the unique entires in the 4-row fastq format

sequencing next-gen • 5.5k views

ADD COMMENT • link updated 24 months ago by Ram 44k • written 9.5 years ago by sumithrasank75 ▴ 140

0

Entering edit mode

You might want to clarify what you mean by "duplicate" in this case. Do you mean that they have the same sequence or that you have a single read from the machine duplicated multiple times?

BTW, the former situation is addressed by RAM's answer, the latter by Pierre's.

ADD REPLY • link 9.5 years ago by Devon Ryan 105k

Ram · Answer 1 · 2015-06-15

1

Entering edit mode

9.5 years ago

Pierre Lindenbaum 164k

gunzip -c in.fq.gz | paste - - - - | LC_ALL=C sort -t '\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz

ADD COMMENT • link updated 24 months ago by Ram 44k • written 9.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

When I run this I get an error:

sort: multi-character tab `\\t'

Also, I have a plain .fastq not the compressed fastq.gz

ADD REPLY • link updated 24 months ago by Ram 44k • written 9.5 years ago by sumithrasank75 ▴ 140

2

Entering edit mode

Try

cat in.fq | paste - - - - | LC_ALL=C sort -t$'\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz

Waiting for someone to write uuoc :)

ADD REPLY • link 9.5 years ago by Sukhi Singh 11k

0

Entering edit mode

Thanks, this works

ADD REPLY • link 9.5 years ago by sumithrasank75 ▴ 140

0

Entering edit mode

Yeah, I used '\t' to show you it's a tab....

ADD REPLY • link updated 24 months ago by Ram 44k • written 9.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for clarifying this

ADD REPLY • link updated 24 months ago by Ram 44k • written 9.5 years ago by sumithrasank75 ▴ 140

Ram · Answer 2 · 2015-06-15

1

Entering edit mode

9.5 years ago

Brian Bushnell 20k

With the BBMap package:

dedupe.sh in=reads.fq out=nodupes.fq

The output will contain exactly 1 copy of every unique sequence. It's extremely fast, but may take more memory than other solutions - the amount of memory is proportional to the number of unique reads (rather than, say, the total input size).

ADD COMMENT • link updated 24 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

score 0 · Answer 3 · 2015-06-15

0

Entering edit mode

9.5 years ago

Ram 44k

PRINSEQ should solve your problem. Check it out here: http://prinseq.sourceforge.net/manual.html#QCDUPLICATION

A bit of digging should get you the command line options for the feature.

ADD COMMENT • link 24 months ago by Ram 44k