Hello,
Which tool or method is available for counting or extracting number of duplicated reads from fastq files with paired reads? I have checked various tools which can only removes duplicated reads
Thanks
Hello,
Which tool or method is available for counting or extracting number of duplicated reads from fastq files with paired reads? I have checked various tools which can only removes duplicated reads
Thanks
Hi Mike,
If you want to deduplicate raw paired fastq files, I recommend trying dedupe.sh
from the BBMap package. You can run it like this:
dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f
That will also print the exact number of duplicates removed.
You can remove duplicates (using picard, samtools or whatever) and then count how many reads are missing from the de-dupped file, no?
HI , have you found any solutions to extract duplication reads from paired fastq files ?
BBMap's dedupe program has an "outd" flag that will capture duplicate reads:
dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f outd=dupes.fq
Alternatively, you can use Clumpify:
clumpify.sh in=reads.fq out=clumped.fq markduplicates allduplicates
This command assumes paired reads are interleaved in a single file, although the upcoming release supports paired reads in twin files. The "allduplicates" flag will mark all copies as duplicates; if you remove that, all but one copy will be marked as duplicates (which is probably better for most purposes). The "optical" flag will mark only optical duplicates (rather than, say, PCR duplicates). Anyway, "clumped.fq" will contain all of the reads, but the duplicates will be marked with " duplicate". So you can then separate them like this:
filterbyname.sh in=clumped.fq out=dupes.fq include=t names=duplicate substring
filterbyname.sh in=clumped.fq out=unique.fq include=f names=duplicate substring
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
If you don't want to remove duplicates you can also, use samtools flags:
These tools are for aligned bam files.
Oh, sorry, my mistake. So, I would suggest you to map first. In my opinion, it is much better to map first and remove duplicates then. But, if you want to remove duplicates first, you could try fastx_collapse, remove the duplicates and count how many of them have you lost.
It should be noted that fastx_collapse only works on single-end reads (this is pretty common for tools like this).
Okay thanks I will give it a shot.
Fastx toolkit is not designed for paired end reads if I am not wrong
Unless you want to use these for assembly, it's generally fast enough to just align and remove/mark duplicates from the resulting BAM file.