I have some paired end fastq files supposedly originating directly from illumina. But they contain some number of records with duplicate names (but different sequences) on which MergeBamAlignment swears. So i need a tool to remove all such duplicates. I saw the advice to use seqtk Duplicate/identical reads in fastq file , but seqtk leaves one of the duplicates untouched. Which may lead to wrong results cos there is no guarantee that it will leave two reads from the same pair.
Is there a tool that removes all reads that have duplicate names?
But they contain some number of records with duplicate names (but
different sequences)
With normal illumina sequence data that should not happen. If at all possible I advise that you go back and find original data. This indicates that someone has fiddled with this file in some way and you have no way of knowing what else may have happened.
That said you may be able to use dedupe.sh from BBMap suite. Take a look at the in-line help. Especially the rmn= parameter.
Thanks! However it seems too complicated to me, and i could not make it do what i want. So i wrote the desired script myself:
#!/usr/bin/python3
import sys
if __name__ == "__main__":
if len(sys.argv) < 2:
print("""rmdup.py - removes all occurences of entries with duplicate names from fastq file
usage: rmdup.py file.fastq > file_rmdup.fastq""")
exit()
l = open(sys.argv[1]).readlines()
d = {}
dd = {}
for i in range(0, len(l), 4):
s = l[i].split()[0]
if s in d:
dd[s] = 1
d[s] = 1
for i in range(0, len(l), 4):
s = l[i].split()[0]
if not s in dd:
for a in range(4):
print(l[i+a], end="")
With normal illumina sequence data that should not happen. If at all possible I advise that you go back and find original data. This indicates that someone has fiddled with this file in some way and you have no way of knowing what else may have happened.
That said you may be able to use
dedupe.sh
from BBMap suite. Take a look at the in-line help. Especially thermn=
parameter.