I have a fastq file with reads, but there are duplicates. Can you tell me how I can get the unique entires in the 4-row fastq format
I have a fastq file with reads, but there are duplicates. Can you tell me how I can get the unique entires in the 4-row fastq format
gunzip -c in.fq.gz | paste - - - - | LC_ALL=C sort -t '\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz
With the BBMap package:
dedupe.sh in=reads.fq out=nodupes.fq
The output will contain exactly 1 copy of every unique sequence. It's extremely fast, but may take more memory than other solutions - the amount of memory is proportional to the number of unique reads (rather than, say, the total input size).
PRINSEQ should solve your problem. Check it out here: http://prinseq.sourceforge.net/manual.html#QCDUPLICATION
A bit of digging should get you the command line options for the feature.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You might want to clarify what you mean by "duplicate" in this case. Do you mean that they have the same sequence or that you have a single read from the machine duplicated multiple times?
BTW, the former situation is addressed by RAM's answer, the latter by Pierre's.