Remove all entries with duplicate names from fastq file?
1
0
Entering edit mode
3.5 years ago
wormball ▴ 10

Hello!

I have some paired end fastq files supposedly originating directly from illumina. But they contain some number of records with duplicate names (but different sequences) on which MergeBamAlignment swears. So i need a tool to remove all such duplicates. I saw the advice to use seqtk Duplicate/identical reads in fastq file , but seqtk leaves one of the duplicates untouched. Which may lead to wrong results cos there is no guarantee that it will leave two reads from the same pair.

Is there a tool that removes all reads that have duplicate names?

Thanks in advance.

fastq illumina duplicate • 1.6k views
ADD COMMENT
1
Entering edit mode

But they contain some number of records with duplicate names (but different sequences)

With normal illumina sequence data that should not happen. If at all possible I advise that you go back and find original data. This indicates that someone has fiddled with this file in some way and you have no way of knowing what else may have happened.

That said you may be able to use dedupe.sh from BBMap suite. Take a look at the in-line help. Especially the rmn= parameter.

ADD REPLY
1
Entering edit mode
3.5 years ago
wormball ▴ 10

Thanks! However it seems too complicated to me, and i could not make it do what i want. So i wrote the desired script myself:

#!/usr/bin/python3

import sys

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("""rmdup.py - removes all occurences of entries with duplicate names from fastq file
usage: rmdup.py file.fastq > file_rmdup.fastq""")
        exit()
    l = open(sys.argv[1]).readlines()
    d = {}
    dd = {}
    for i in range(0, len(l), 4):
        s = l[i].split()[0]
        if s in d:
            dd[s] = 1
        d[s] = 1
    for i in range(0, len(l), 4):
        s = l[i].split()[0]
        if not s in dd:
            for a in range(4):
                print(l[i+a], end="")
ADD COMMENT
1
Entering edit mode

Since you have a solution that works I moved your comment to an answer. You can go ahead and accept this answer to provide closure to this thread.

ADD REPLY

Login before adding your answer.

Traffic: 2735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6