how to classify large scale orf data
1
0
Entering edit mode
9.3 years ago
Eva_Maria ▴ 190

Hi

I have all possible orf region of a bacterial species (about 50 mb). Then I want to annotate which protein related to these orfs. Any tool is available for to do this (instead of blast)?

Assembly gene sequence • 1.8k views
ADD COMMENT
0
Entering edit mode

For full-length gene/protein homology search, BLAST is the gold standard - or at least it is considered as such by many. Alternatively, you can also annotate protein domains (e.g., PFAM) using hmmer oder InterPro Scan. Via these annotations you may be able to find other proteins that are likely homologous by comparing the protein domain composition. But I do not know, if this is what you desire to do? What is your reason against BLAST? Probably, a better solution can be found if you specify your problem more exactly...

ADD REPLY
0
Entering edit mode

Actually I have about 284517 orfs so it's not possible to analyse on-line

ADD REPLY
0
Entering edit mode

and also classify these proteins as hypothetical or not

ADD REPLY
0
Entering edit mode

Why not run BLAST locally on your facility/institute? In that case you could also create your own BLAST database containing only bacterial species, which are less sequences than in the full NCBI-NR database. With such a restricted database you should definitively be able to execute BLAST locally, or?

ADD REPLY
0
Entering edit mode

BLASTP of 300k ORFs against nr split over 128 cores (16 x 8 thread blasts) would take about one week. You could speed this up tremendously by decreasing the number of query sequences through pre-clustering at high percent identity like say 85-95%. Alternatively, you could try a faster BLAST-like program like e.g. USEARCH or DIAMOND (claim 20k speed up over BLAST). Another way to speed things up is picking a smaller reference database like e.g. refseq_protein or uniref50/90.

ADD REPLY
0
Entering edit mode
9.3 years ago
Michael 55k

The standard way of analysis is to run a bacterial gene-predictor e.g. glimmer, and then analyze only the ORFs that are predicted to be coding.

ADD COMMENT

Login before adding your answer.

Traffic: 2123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6