I have a script that can compare one column (alignment length) of data lines of multiple (154) BLAST result text files to a reference file (sequence length), and get rid of duplicate lines in each file (leaving behind one copy). However, I want to refine the latter function to get rid of ALL copies of all lines with >=2 copies within a file, leaving only those that had a single copy to begin with behind. What is the best technique to do this?
import csv
seqlengthfile = '/my/directory/seqlength2.txt'
with open(seqlengthfile) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=' ')
target_dict = dict()
for row in csv_reader:
target_dict[row[0]] = float(row[1])
print(target_dict)
to_process_file = "/process/directory/pident90"
size = 154
output_folder = "/output/directory/align95"
for i in range(1, size+1):
process_filename = to_process_file + "/BLAST_" + str(i) + ".txt"
output_filename = output_folder + "/BLAST_" + str(i) + ".txt"
seen = set()
with open(output_filename, "w") as output_file:
with open(process_filename) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=' ')
for row in csv_reader:
to_write = " ".join(row) + "\n"
if float(row[3]) >= 0.95 * target_dict[row[1]] and to_write not in seen: #how to refine this?
output_file.write(to_write)
seen.add(to_write)
Thanks! Just to be clear, will this mean I will need to remove my previous filtering mechanism (the set 'seen')? Also, how do I run this for a large number of files at once, as uniq -u only seems to work on individual files (and running so many individual commands wouldn't be ideal)?