I have tried samtools rmdup on my paired end fastq files, which were earlier trimmed. According to the samtools manual, rmdup works as follows: Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality.
I have 23% duplicates in my data (found by aligning raw reads to the reference). Trimming the raw reads would have trimmed duplicates into reads of different lengths. How then would rmdup work on my pre-processed reads?
Is there a better option?
There are two things that you (I think) are mixing up.
1) trimming is done on the fastq files, but rmdup works on aligned data
2) duplicates are defined by the 5' ends of the paired-end data, while trimming takes away bases from the 3' end, so no worries, trimming will not affect duplicate removal.
"external coordinates" means both 5' as well as 3', doesn't it?
No, as in paired-end, the insert size, which defines the fragment, are solely defined by the 5' ends of the respective fwd and rev reads. Have a look at this figure, you can see that no matter where the 3' ends are (so the arrow heads), the 5' ends are unaffected by this, and so is the insert size and by this the definiton of a duplicate.
Thanks! Got it! Why do I still have 14% duplicates?
Also, I had 23% duplicates earlier. After using rmdup, I'm still left with 14% duplicates.
samtools rmdum does not remove duplicates when paired reads map to different chromosomes. Do these 14% duplicates left map to different chromosomes?
How can I find this out?
Hello,
how do you trimm your reads? When trimming it can happen that not all reads survive it because they are to short now.
fin swimmer
I used bbduk from bbtools to trim my reads. Yes, I lost some reads in the process of doing so, but not many.