Question: How Can I get a list of sequences in fasta format from the Blast output hits?
Question: How Can I get a list of sequences in fasta format from the Blast output hits?
if you're using the blast+ version : use the blastdbcmd utility (comes with the blast installation) to retrieve sequences from the blastDBs , either use '-entry <id>' for a single sequence or '-entry_batch <file>' for a whole list of IDs . You'll still have to first get the list of IDs you want to retrieve as suggested by st.ph.n
Another method:
-outfmt 5
in BLAST+ suite command line. This will give you a xml file as output.Use this little python script to parse rapidly data.
import re
output = open('file.out','w')
n = 0
print >> output, 'protein'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e-value'
with open('blast_output.xml','r') as xml:
for line in xml:
if re.search('<Iteration_query-def>', line) != None:
n = n + 1
line = line.strip()
line = line.rstrip()
line = line.strip('<Iteration_query-def>')
line = line.rstrip('</')
query_def = line
if re.search('No hits found', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('<Iteration_message>')
line = line.rstrip('</')
print n
print >> output, query_def+'\t'+line
if re.search('<Hit_def>', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('<Hit_def>')
line = line.rstrip('</')
hit_def = line
if re.search('<Hit_accession>', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('<Hit_accession>')
line = line.rstrip('</')
hit_acc = line
if re.search('<Hsp_evalue>', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('<Hsp_evalue>')
line = line.rstrip('</')
e_val = line
print n
print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val
output.close()
This step will give you a pretty file with proteins, hit name, acession number of hit and e-value as columns. This file can be also parsed easily.
Thanks!
I created something like that and it seems to work (for me:) ).
Best
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Use a format such as tab-delimited
-outfmt 6
(makes it easier to parse IMO), and retrieve the query hit ids you want, an go back to your query FASTA and parse out the sequences that you want based on headers.Thank you for a suggestion!!!