Counting number of duplicated reads from fastq file
3
0
Entering edit mode
9.9 years ago
mike ▴ 90

Hello,

Which tool or method is available for counting or extracting number of duplicated reads from fastq files with paired reads? I have checked various tools which can only removes duplicated reads

Thanks

NGS • 8.4k views
ADD COMMENT
2
Entering edit mode
9.9 years ago

Hi Mike,

If you want to deduplicate raw paired fastq files, I recommend trying dedupe.sh from the BBMap package. You can run it like this:

dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f

That will also print the exact number of duplicates removed.

ADD COMMENT
1
Entering edit mode
9.9 years ago
iraun 6.2k

You can remove duplicates (using picard, samtools or whatever) and then count how many reads are missing from the de-dupped file, no?

ADD COMMENT
0
Entering edit mode

If you don't want to remove duplicates you can also, use samtools flags:

samtools view -c -f 1024 file.bam
ADD REPLY
0
Entering edit mode

These tools are for aligned bam files.

ADD REPLY
0
Entering edit mode

Oh, sorry, my mistake. So, I would suggest you to map first. In my opinion, it is much better to map first and remove duplicates then. But, if you want to remove duplicates first, you could try fastx_collapse, remove the duplicates and count how many of them have you lost.

ADD REPLY
2
Entering edit mode

It should be noted that fastx_collapse only works on single-end reads (this is pretty common for tools like this).

ADD REPLY
0
Entering edit mode

Okay thanks I will give it a shot.

ADD REPLY
0
Entering edit mode

Fastx toolkit is not designed for paired end reads if I am not wrong

ADD REPLY
0
Entering edit mode

Unless you want to use these for assembly, it's generally fast enough to just align and remove/mark duplicates from the resulting BAM file.

ADD REPLY
0
Entering edit mode
8.0 years ago
kaixian110 • 0

HI , have you found any solutions to extract duplication reads from paired fastq files ?

ADD COMMENT
0
Entering edit mode

BBMap's dedupe program has an "outd" flag that will capture duplicate reads:

dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f outd=dupes.fq

Alternatively, you can use Clumpify:

clumpify.sh in=reads.fq out=clumped.fq markduplicates allduplicates

This command assumes paired reads are interleaved in a single file, although the upcoming release supports paired reads in twin files. The "allduplicates" flag will mark all copies as duplicates; if you remove that, all but one copy will be marked as duplicates (which is probably better for most purposes). The "optical" flag will mark only optical duplicates (rather than, say, PCR duplicates). Anyway, "clumped.fq" will contain all of the reads, but the duplicates will be marked with " duplicate". So you can then separate them like this:

filterbyname.sh in=clumped.fq out=dupes.fq include=t names=duplicate substring
filterbyname.sh in=clumped.fq out=unique.fq include=f names=duplicate substring
ADD REPLY
1
Entering edit mode

One can easily get interleaved data files for clumpify.sh by using another tool from BBMap: reformat.sh in1=R1.fg.gz in2=R2.fq.gz out=int.fq.gz.

ADD REPLY

Login before adding your answer.

Traffic: 2073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6