removing duplicate sequences while extarcting the reads from fastq.gz
3
0
Entering edit mode
5.2 years ago

I can extract the reads from fastq.gz file as follows.

 gunzip -c in.fastq.gz | awk '(NR%4==2)' > out.seq

Is there anyway, that I only extract the unique reads and discard any duplicate reads?

genome sequencing sequence • 2.5k views
ADD COMMENT
3
Entering edit mode

How do you define a duplicate read? Same sequence? Same identifier? Same sequence and quality? All of those? How did you end up with duplicate reads?

ADD REPLY
1
Entering edit mode

I don't know if I understand your question but you can use Picard's MarkDuplicates (check on manual) to remove duplicated reads!

ADD REPLY
1
Entering edit mode
5.2 years ago
gb ★ 2.2k
  1. quality trim out.seq
  2. Check length distribution
  3. trim all reads to the same length
  4. use vsearch --derep_fulllength
ADD COMMENT
1
Entering edit mode
5.2 years ago

Are you sure you want to do this at the fastq level? (I don't understand why you want to do this at all) Do you really want to count every sequence with a one-off error as a unique sequence?

The typical thing to do would be to align your reads to their reference, then use picardtools MarkDuplicates.

But if you really want to get unique sequences in the raw fastq:

zcat my.fastq.gz | awk 'NR%4==2' | awk '!x[$0]++' > unique.txt
ADD COMMENT
1
Entering edit mode

A reference may not always be available.

Would that awk solution scale well if one has millions of reads? This is where clumpify comes in handy.

ADD REPLY
1
Entering edit mode

I haven't tested. Its virtue is you don't have to install any software. It might eat up a lot of memory; since it's not sorting, I guess it remembers every sequence it saw.

ADD REPLY
0
Entering edit mode
5.2 years ago
GenoMax 147k

Use clumpify.sh from BBMap suite. You can use fastq data as is. I suggest you do no other manipulations. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

You can choose to allow one or more errors. Separate PCR/optical duplicates.

ADD COMMENT

Login before adding your answer.

Traffic: 1619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6