Entering edit mode
9.3 years ago
Eva_Maria
▴
190
Hi
I have all possible orf region of a bacterial species (about 50 mb). Then I want to annotate which protein related to these orfs. Any tool is available for to do this (instead of blast)?
For full-length gene/protein homology search, BLAST is the gold standard - or at least it is considered as such by many. Alternatively, you can also annotate protein domains (e.g., PFAM) using hmmer oder InterPro Scan. Via these annotations you may be able to find other proteins that are likely homologous by comparing the protein domain composition. But I do not know, if this is what you desire to do? What is your reason against BLAST? Probably, a better solution can be found if you specify your problem more exactly...
Actually I have about 284517 orfs so it's not possible to analyse on-line
and also classify these proteins as hypothetical or not
Why not run BLAST locally on your facility/institute? In that case you could also create your own BLAST database containing only bacterial species, which are less sequences than in the full NCBI-NR database. With such a restricted database you should definitively be able to execute BLAST locally, or?
BLASTP of 300k ORFs against nr split over 128 cores (16 x 8 thread blasts) would take about one week. You could speed this up tremendously by decreasing the number of query sequences through pre-clustering at high percent identity like say 85-95%. Alternatively, you could try a faster BLAST-like program like e.g. USEARCH or DIAMOND (claim 20k speed up over BLAST). Another way to speed things up is picking a smaller reference database like e.g. refseq_protein or uniref50/90.