I have been given a task to compare the all the protein sequences of a strain of campylobacter with a strain of E.coli. I would like to do this locally using Biopython and the inbuilt Blast tools. However, I'm stuck on how to program this and what tools I should be using. If anybody could point me in the right direction, I would be thankful!
@satsurae It seems that you have found your answer! :) For the benefice of the BioStar users, would you consider accepting one of the proposed answers? Cheers.
Hi, if you're having trouble, maybe it'd be wiser to do the simplest analysis -- just do an all-vs-all blastp directly using the command-line-interface to blast. For example use E.coli as the subject and the other species as the query and a command like:
Using the -m 8 option, you can parse the tab-delimited blast file with few lines of code. This assumes you have the protein fasta files in hand -- the are probably available from ncbi or your favorite bacteria genome repository.
I agree, this seems like the best way to accomplish your goal - don't overcomplicate with BioPython where it simply isn't necessary. By the way, you can grab the sequences you need from UniProt: search for "organism:$tax_id AND keyword:"Complete proteome"" where $tax_id is the taxonomy identifier of the strain you're interested in (so E. coli K12 is 83333 and C jejuni is 197). Though this method does rely on the curation levels of UniProt, which may not be so good for organisms a little off the beaten track.
Small stuff: if possible move tabulated blast output into an SQL database. You will save time needed to write the code iterating over text columns, say for finding best reciprocal hits (BRH).
Hi guys, thanks for all the help. However, after using the above formatdb line, I get a fatal error saying that the database was not found when using the blastall code. I think there is a problem in the indexing of the fasta to a 'database'. I wouldn't have thought I would need to index both database and query files. Is there another way to index the fasta to a db? Cheers!
The BLAST section of the BioPython module is not terribly well documented. The relevant section is here.
Before starting you'll need to create a local BLAST database. To my knowledge there is no way to do this directly through BioPython (but you could use the Subprocess module to automate the command line if you really wanted too). The documentation for the BLAST tool is on the NCBI website here.
You'll obviously need to download the whole genome sequences. I suggest using the NCBI FTP site, I can never seem to find the right link in the normal web-portal.
Once you have all of the relevant downloads and databases created you'll simply need to run the BLAST query in a loop that processes all of the data. Something like this should work (I don't have the required data to test this but it should get you 95% of the way there.)
from Bio.Blast.Applications import NcbiblastxCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
import subprocess
SOURCE_FASTA = '/path/to/source/seq.fasta'
DATABASE = 'databsename' #should be in the path but YMMV
with open(SOURCE_FASTA) as inhandle:
for seq in SeqIO.parse(handle, 'fasta'):
with open('scratch.fasta', 'w') as outhandle:
#write a scratch file
SeqIO.write(seq, outhandle, 'fasta')
#create the commandline string
cline = NcbiblastxCommandline(query='scratch.fasta',
db=DATABASE, evalue=0.001, outfmt=5, out="scratch.xml")
#actually run BLAST
return_code = subprocess.call(str(cline))
if return_code == 0:
#if it was successful then process it
with open('scratch.xml') as xmlhandle:
blast_record = NCBIXML.read(xmlhandle)
do_something_with_results(blast_record)
Hope that helps.
ADD COMMENT
• link
updated 13 months ago by
Ram
44k
•
written 14.8 years ago by
Will
4.6k
If the genomes are public, someone may have done this analysis for you, or there may be a web-based tool to do the job.
For example, here is a genome comparison tool at the JCVI/CMR.
The JGI Integrated Microbial Genome system is also very good. It allows you to select multiple genomes for various comparative analyses. I believe they may even have all versus all data buried away in their FTP archive somewhere. However, take some time to get used to the site - it has quite a steep learning curve.
I just want to make a suggestion, if you are interested in visually comparing the two genomes, I highly recommend the Artemis and ACT (Artemis Comparison Tool) from the Sanger institute.
If you have the full genome, then you can easily visualize all the reading frames. With ACT, you use the two genome sequences (or one genome and a fasta file of genes/proteins or even two gene/protein fasta files) as well as a "comparison file", which is a blast output file.
It is very useful for annotation and curation!
ADD COMMENT
• link
updated 13 months ago by
Ram
44k
•
written 14.8 years ago by
Nicojo
★
1.1k
@satsurae It seems that you have found your answer! :) For the benefice of the BioStar users, would you consider accepting one of the proposed answers? Cheers.