Question

Remove all entries with duplicate names from fastq file?

0

Entering edit mode

3.5 years ago

wormball ▴ 10

Hello!

I have some paired end fastq files supposedly originating directly from illumina. But they contain some number of records with duplicate names (but different sequences) on which MergeBamAlignment swears. So i need a tool to remove all such duplicates. I saw the advice to use seqtk Duplicate/identical reads in fastq file , but seqtk leaves one of the duplicates untouched. Which may lead to wrong results cos there is no guarantee that it will leave two reads from the same pair.

Is there a tool that removes all reads that have duplicate names?

Thanks in advance.

fastq illumina duplicate • 1.6k views

ADD COMMENT • link 3.4 years ago by wormball ▴ 10

1

Entering edit mode

But they contain some number of records with duplicate names (but different sequences)

With normal illumina sequence data that should not happen. If at all possible I advise that you go back and find original data. This indicates that someone has fiddled with this file in some way and you have no way of knowing what else may have happened.

That said you may be able to use dedupe.sh from BBMap suite. Take a look at the in-line help. Especially the rmn= parameter.

ADD REPLY • link 3.5 years ago by GenoMax 147k

score 1 · Accepted Answer · 2021-06-16

Thanks! However it seems too complicated to me, and i could not make it do what i want. So i wrote the desired script myself:

#!/usr/bin/python3

import sys

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("""rmdup.py - removes all occurences of entries with duplicate names from fastq file
usage: rmdup.py file.fastq > file_rmdup.fastq""")
        exit()
    l = open(sys.argv[1]).readlines()
    d = {}
    dd = {}
    for i in range(0, len(l), 4):
        s = l[i].split()[0]
        if s in d:
            dd[s] = 1
        d[s] = 1
    for i in range(0, len(l), 4):
        s = l[i].split()[0]
        if not s in dd:
            for a in range(4):
                print(l[i+a], end="")