I presume you are trying to BLAST a set of genome sequences against themselves, and find apparent paralogs?
In that case you first need to run blastn locally, NCBI gives you loads of documentation on this. If you want to use Biopython to do the processing then have the blast executable spit out an xml file (-outfmt= 5
).
You should read the Biopython tuotrial section on BLAST parsing to get an idea of what each blast record is representing. But to get results on the top hit for each record you could do something like
from Bio.Blast import NCBIXML
recs = NCBIXML.parse(open("path_to_your_blast.xml"))
def score_top_hit(rec):
""" """
if rec.alignments:
top_hit = rec.alignments[0].hsps[0]
score = float(top_hit.identities)/ top_hit.align_length
else:
score = 0.0
return score
scores = [score_top_hit(r) for r in recs]
There is almost certainly a more memory/time efficient way of getting the number of sequences above a given cut than this, but it'll work
cuts = [i/100.0 for i in range(0,110, 10)] #hacky way of making a range of floats - must be nicer way!
zip(cuts, [ sum([c > s for s in scores]) for c in cuts])
EDIT to cover some of armahmud3's comments below
Hi armahmud3. I gather this is a homework question. It's great that your teacher encourages you to seek help form others, and Biostar can certainly help you learn bioinformatics. But I don't think anyone is served by providing with a "copy paste" a solution to your problem. Instead, I'll give you some pointers about how to attack the problem.
From your problem statement: you want to BLAST a set of gene sequences against each other, collect some information form the resulting output, and summarise it in a .csv file. Although Biopython does have "wrapper" scripts to help you run BLAST locally, you don't need to use them. Instead you can simply use the BLAST executables form the commandline. The NCBI's documentation of the BLAST programmes is very good. You'll want to download your sequences, use makeblastdb
to create a database and blastn
to run your search.
The code I gave you above will count up the number of sequences where the "top hit" is above or below each threshold. You should also pay around a bit with the BLAST parser, to get a feel of what each record represents. In an interactive session, check out:
>>> recs = NCBIXML.parse(open("path_to_your_blast.xml"))
>>> r = recs.nex()
>>> dir(r)
>>> r.alignments
Finally you want to spit the results out as .csv
file. Read up on python input and output handling to see how to do this.
I remember that you asked similar question earlier. Could you please provide some examples input file ?
I am in a serious problem regarding this python script. Don't get solution yet and this is why I am asking repeatedly for help in this forum. Input file should be a single genome file in genbank format like NC_011000.gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Burkholderia_cenocepacia_J2315_uid57953/ I am a beginner in programing language. I try to solve this problem by reading biopython tutorial , cookbook. unfortunately I did not write the script yet.
Why you want to start task from .gbk files ? Why not .fasta files ? Just wanna reconfirm that you were saying about BLASTN ?