Python Script for extracting out Query and Alignment sequence from an all_v_all Blast
2
0
Entering edit mode
7.2 years ago
apulunuj ▴ 30

I have some blast output in tsv format. The first column is the query_id, the second is the subject_id, and the third column is the alignment. An example format of the data is:

Dictyostelium_gi | 550558531 | gb |   Dictyostelium_gi | 550558531 |gb  YKLADEQWSTTILLIMCQE..


Dictyostelium_gi | 550558531 | gb |   Physarum_gi | 560558531 |gb  DEPPALI-X-WSTTILLIMCQE..

I made a script to extract out the line that has the query and subject id's equal like in the first line of the example. And also, make a file that contains the query_id and alignment match from the line of interest. Here is the script:

! /usr/bin/env python3.5

import sys

blast_out = open('out.tsv','r')

for line in blast_out:

   line = line.split('\t')
   query_id = line[0]
   subject_id = line [1]
   alignment = line [2]
   if 'query_id' == 'subject_id' in line:
      output = line
      for i in output:
         seq_id = output[0]
         seq_seq = output[3]
         blast_output = open('blast_output','w')
         blast_output = ('>' + 'seq_id' + '\n' + 'seq_seq')
       return blast_output

I am having issues generating the output I want. That is, the line from the tsv file, which has its query and subject id's equal and the file containing the query_id and alignment from the line. What is the issues in my script, please and thank you?

python blast • 3.8k views
ADD COMMENT
1
Entering edit mode
7.2 years ago
shoujun.gu ▴ 380

Try something like this:

output=[]
with open('Your_file', 'r') as f:
    for line in f:
        line = line.split('\t')
        query_id = line[0]
        subject_id = line [1]
        alignment = line [2]   
        if query_id == subject_id:
            output.append(line)
out=[]
for i in output:
    l=i.split('\t')
    blast_output='>'+l[0]+'\n'+l[3]
    out.append(blast_output)

with open('blast_output','w') as file:
    file.writelines(out)
ADD COMMENT
0
Entering edit mode

also you have to double check the return of: line = line.split('\t')

sometime it will contain an empty element '' at the beginning. Thus, you need to adjust the list index accordingly.

ADD REPLY
0
Entering edit mode

I checked that, the list index seems to be correct. But I'm still not getting any output. I find it very puzzling.

ADD REPLY
0
Entering edit mode

There are some problems in your original code

Try my posted code instead to see whether it works or not.

ADD REPLY
0
Entering edit mode

The script runs completely. However, the output file 'blast_output' is empty.

ADD REPLY
0
Entering edit mode

try add :

print(output)

in the code, to see whether this list is empty.

ADD REPLY
0
Entering edit mode

I debugged it the output is empty, but everything else shows results when I use the print command.

ADD REPLY
0
Entering edit mode

then there must be something wrong with the file parse, could you paste some lines of your input file? since I don't see any tabs in your posted example:

Dictyostelium_gi | 550558531 | gb | Dictyostelium_gi | 550558531 |gb YKLADEQWSTTILLIMCQE..

ADD REPLY
0
Entering edit mode

I corrected some errors in the previous coding. But:

your posted example is not well formatted. I tried to sep the line with tab, it doesn't work. If it's due to copy/paste problem, you could try my updated code to see whether it works for your real file.

Besides, if your file is clean, use awk as Pierre Lindenbaum's comments is much simpler than write a python script.

ADD REPLY
0
Entering edit mode

The code and Pierre helped. Thanks a bunch!

ADD REPLY
1
Entering edit mode
7.2 years ago

the following awk script should do the job

awk -F '\t' '{if($1==$2) printf(">%s\n%s\n",$1,$3);}'

however, in your example, the two columns are not strictly the same, there is an extra '|' at the end of the first column...

ADD COMMENT
0
Entering edit mode

Thank you this worked.

ADD REPLY

Login before adding your answer.

Traffic: 1336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6