Entering edit mode
3.0 years ago
rb77
•
0
Hello,
I have to blast multiple protein sequences from a given species (in a mulfasta file) against the human protein database, and the goal is to find the corresponding closest homolog for each protein sequence.
I'm wondering if there is a way to automate this process? of running individual blastp queries for each protein sequence against the whole human protein db and then grabbing top hits of each query? thank you and would appreciate any advice on this.
I would blast the whole multifasta file to the DB and grep afterwards, otherwise you will create a substantial amount of "overhead", loading the DB into memory each time etc ...
when i try to blast the whole multifasta file to the DB it says
"Your total query length is greater than allowed on the BLAST webserver. You can either reduce the size to 100,000 or less and try again or run stand-alone <@STANDALONE_DOC@> or our <@STANDALONE_DOC_CLOUD@>."
also, I need the top hit for each protein sequence in the fasta file.. so im not sure if blasting the whole multifasta file will work..
Sounds like you are doing this at NCBI remotely. Perhaps split your multi-fasta file into pieces and try. If you have thousands of sequences then blast public resource is not meant to support that kind of application.
While not advisable you could select only 1 (ideally NCBI recommends 5 since the first hit is not guaranteed to be the best) "hit" per query.
Furthermore, a simple BLAST is insufficient to establish homology. There are dedicated tools for this, some of which are based on blast.
I would recommend a literature search