How to get rid of all copies of repeated lines in BLAST result text files
1
0
Entering edit mode
20 months ago
Hau ▴ 10

I have a script that can compare one column (alignment length) of data lines of multiple (154) BLAST result text files to a reference file (sequence length), and get rid of duplicate lines in each file (leaving behind one copy). However, I want to refine the latter function to get rid of ALL copies of all lines with >=2 copies within a file, leaving only those that had a single copy to begin with behind. What is the best technique to do this?

 import csv

 seqlengthfile = '/my/directory/seqlength2.txt'

with open(seqlengthfile) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=' ')
    target_dict = dict()
    for row in csv_reader:
        target_dict[row[0]] = float(row[1])

print(target_dict)

to_process_file = "/process/directory/pident90"
size = 154
output_folder = "/output/directory/align95"
for i in range(1, size+1):
    process_filename = to_process_file + "/BLAST_" + str(i) + ".txt"
    output_filename = output_folder + "/BLAST_" + str(i) + ".txt"
    seen = set()
    with open(output_filename, "w") as output_file:
        with open(process_filename) as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=' ')
            for row in csv_reader:
                to_write = " ".join(row) + "\n"
                if float(row[3]) >= 0.95 * target_dict[row[1]] and to_write not in seen: #how to refine this?
                    output_file.write(to_write)
                    seen.add(to_write)
BLASTX Python BLAST • 1.1k views
ADD COMMENT
1
Entering edit mode
sort | uniq -u
ADD REPLY
0
Entering edit mode

Thanks! Just to be clear, will this mean I will need to remove my previous filtering mechanism (the set 'seen')? Also, how do I run this for a large number of files at once, as uniq -u only seems to work on individual files (and running so many individual commands wouldn't be ideal)?

ADD REPLY
0
Entering edit mode
20 months ago
Alban Nabla ▴ 30

If I understood the question right, I would suggest to use pandas, instead of csv. I assume you have a folder full of hit table .txt files (and for some reasons without headers, based on your example?). Pandas will easily allow you to remove duplicates or create frequency tables, to achieve what you requested.

Very simplified example, to give an idea:

import pandas as pd
titles = ['query acc.ver', 'subject acc.ver', '% identity', 'alignment length', 'mismatches', 'gap opens', 'q. start', 'q. end', 's. start', 's. end', 'evalue', 'bit score', '% positive']
df = pd.read_csv('BlastEx.txt', sep='\t', names = titles)

#to drop duplicates
df = df.drop_duplicates(subset='subject acc.ver')

#in alternative to above, to create a list with only single occurrences and no duplicates at all
freq = pd.crosstab(index=df['subject acc.ver'], columns='count')
desired_list = freq[freq['count'] < 2]
ADD COMMENT

Login before adding your answer.

Traffic: 2207 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6