Question

How to get rid of all copies of repeated lines in BLAST result text files

0

Entering edit mode

20 months ago

Hau ▴ 10

I have a script that can compare one column (alignment length) of data lines of multiple (154) BLAST result text files to a reference file (sequence length), and get rid of duplicate lines in each file (leaving behind one copy). However, I want to refine the latter function to get rid of ALL copies of all lines with >=2 copies within a file, leaving only those that had a single copy to begin with behind. What is the best technique to do this?

 import csv

 seqlengthfile = '/my/directory/seqlength2.txt'

with open(seqlengthfile) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=' ')
    target_dict = dict()
    for row in csv_reader:
        target_dict[row[0]] = float(row[1])

print(target_dict)

to_process_file = "/process/directory/pident90"
size = 154
output_folder = "/output/directory/align95"
for i in range(1, size+1):
    process_filename = to_process_file + "/BLAST_" + str(i) + ".txt"
    output_filename = output_folder + "/BLAST_" + str(i) + ".txt"
    seen = set()
    with open(output_filename, "w") as output_file:
        with open(process_filename) as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=' ')
            for row in csv_reader:
                to_write = " ".join(row) + "\n"
                if float(row[3]) >= 0.95 * target_dict[row[1]] and to_write not in seen: #how to refine this?
                    output_file.write(to_write)
                    seen.add(to_write)

BLASTX Python BLAST • 1.1k views

ADD COMMENT • link updated 19 months ago by Alban Nabla ▴ 30 • written 20 months ago by Hau ▴ 10

1

Entering edit mode

sort | uniq -u

ADD REPLY • link 20 months ago by 5heikki 11k

0

Entering edit mode

Thanks! Just to be clear, will this mean I will need to remove my previous filtering mechanism (the set 'seen')? Also, how do I run this for a large number of files at once, as uniq -u only seems to work on individual files (and running so many individual commands wouldn't be ideal)?

ADD REPLY • link 20 months ago by Hau ▴ 10

score 0 · Answer 1 · 2023-04-09

If I understood the question right, I would suggest to use pandas, instead of csv. I assume you have a folder full of hit table .txt files (and for some reasons without headers, based on your example?). Pandas will easily allow you to remove duplicates or create frequency tables, to achieve what you requested.

Very simplified example, to give an idea:

import pandas as pd
titles = ['query acc.ver', 'subject acc.ver', '% identity', 'alignment length', 'mismatches', 'gap opens', 'q. start', 'q. end', 's. start', 's. end', 'evalue', 'bit score', '% positive']
df = pd.read_csv('BlastEx.txt', sep='\t', names = titles)

#to drop duplicates
df = df.drop_duplicates(subset='subject acc.ver')

#in alternative to above, to create a list with only single occurrences and no duplicates at all
freq = pd.crosstab(index=df['subject acc.ver'], columns='count')
desired_list = freq[freq['count'] < 2]