Question

Working with redundant information on BLAST output

0

Entering edit mode

5.8 years ago

flogin ▴ 280

Hi guys,

Currently I'm working with a lot of BLASTn analysis, I have an output in tabular format (argument -outfmt 6) with more than 170 thousand lines. In that link: https://textuploader.com/11krp has an example with pieces of information that I want: query, subject, subject start, subject end and score (in that order).

As we can see, different queries can be matched with the same subject, in the same or in different positions.

In the next steps, I'm gonna extract those regions of the subject, using the positions of start and end, but if I make that extraction with this type of information, I'm gonna recovery a lot of redundant sequences..

In my point of view, have 4 cases of redundant matches:

1 - Same region of subject = same s_start and same s_end with different scores;

e.g. lines 29, 33, 37 and 43

2 - Almost the same regions of subject 1 = s_start different, s_end equal with different scores;

e.g. lines 26 (s_start = 928719), 18, 30, 34, 38 (s_start = 928718)

3 - Almost the same region of subject 2 = s_start equal, s_end different with different scores

e.g. lines 18, 30, 34, 38 (s_end = 929459) and 44 (s_end = 929456).

4 - case four, different length of the same region = s_tart and s_end differents, but covering the same subject region, different scores.

e.g. lines 17 (s_start = 922442, s_end = 923192), 29, 33, 37, 43 (s_tart = 922444, s_end = 923190)

So... I have a little experience with Python, and wrote the following script:

import csv
# openning file
with open('blast_test.csv') as csv_file:
    subject_dict = {} # dictionary to store original informations
    subject_dict_2 = {} #dictionary to store filtred informations
    csv_reader = csv.reader(csv_file, delimiter=',')
# creating a dictionary with subjects information
    #reading file line by line
    for row in csv_reader:
        print(row)
        #atribuiting each column to one variable, modfying the name of subject
        query,subject_old, subject_new, s_start, s_end, score = row[0],row[1],row[1]+'_'+row[2]+'_'+row[3], row[2], row[3], row[4]
        # inserting subjects in a dictionary
        subject_dict[subject_new] = [subject_old, query, s_start, s_end]
        #
#testing dictionary
for k,v in subject_dict.items():
    print(k,':',v)

making comparisons
for k,v in subject_dict.items():
#    if 


# creating an output
with open('blast_test_filtred.csv', mode='w') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')
    for subject in subject_dict:
        writer.writerow([subject, s_start, s_end, score, query)])

My logical is:

1 - Create a dictionary with all informations, changing the subject name (just to facilitate my understanding of outputs)

2 - Remove the redundant information using the criteria of the four cases cited above;

3 - Write this new information in an output file.

To remove this redundant information, I think in create a threshold of 10 nucleotides up and downstream of each region (start and end), following by comparisons of regions using the subject original name (subject_old), and selecting the region with the best score (in a manner to recovery all different regions)).

Can anyone explain to me how I can perform this step cited above?

Thanks.

blast filter python dictionary • 1.5k views

ADD COMMENT • link 5.7 years ago by flogin ▴ 280

2

Entering edit mode

The easiest way is to use pandas library in python for such tabular data. Once the data is in a dataframe format, it would be easier for you to group, filter and aggregate this data. There are a range of pandas tutorials, blogs and worked examples available that would help you with this.

ADD REPLY • link 5.8 years ago by Sej Modha 5.3k

0

Entering edit mode

Thanks Sej, I'm gonna check this!

ADD REPLY • link 5.8 years ago by flogin ▴ 280

0

Entering edit mode

So.. I work now with pandas and with all columns of BLAST output.

import pandas as pd
header_outfmt6 = ['qseqid','sseqid','pident','length','mismatch','gapopen','qstart','qend','sstart','send','evalue','bitscore']
df = pd.read_csv('blast_brute.txt', sep='\t',header = None,names = header_outfmt6)
df[(df.sseqid==df.sseqid) & (df.sstart != df.sstart) & (df.send != df.send)]

But only the column names are returned....

ADD REPLY • link 5.8 years ago by flogin ▴ 280

0

Entering edit mode

It would be easier to use groupby in your dataframe and then sort by column of interest.

Also see https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html and https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas for comparing column values in pandas.

ADD REPLY • link 5.8 years ago by Sej Modha 5.3k

0

Entering edit mode

I think that groupby maybe don't be applied to my problem, because, following the guide from pydata.org, the filter work in 2 ways: Discard data that belongs to groups with only a few members or Filter out data based on the group sum or mean. And I want to make interactive comparisons between several columns...

Using the method .drop_duplicates() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html), I got to remove the duplicates matches based on sseqid, sstart and ssend:

import pandas as pd
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-in", "--input", help="blast outfmt 6 format",  required=True)
args = parser.parse_args()
read_file = args.input

header_outfmt6 = ['qseqid','sseqid','pident','length','mismatch','gapopen','qstart','qend','sstart','send','evalue','bitscore']
df = pd.read_csv(read_file, sep='\t',header = None,names = header_outfmt6)
df_2 = df.drop_duplicates(subset=['sseqid','sstart','send'])

out_csv = read_file+'.filtred'
df_2.to_csv(out_csv, sep='\t')

But have more 2 steps that I could not execute yet: - make those comparisons in a range of nucleotides (-10 and + 10 of each sstart and send) - Keep the row with major bitscore...

ADD REPLY • link 5.7 years ago by flogin ▴ 280