Question

Orf Finder Vs Blastx

2

Entering edit mode

14.4 years ago

User 3822 ▴ 60

This is kind of a stupid question. Suppose I have a contig sequence and I want to know what kind of protein it might encode (if it does). I run it against NCBI ORF Finder and then choose to do a blast and cognitor for the longest reading frame. Get a few hits, and conserved domains for a protein. Next I run the same contig thru NCBI blastx, and get hits for a couple of proteins. But the protein I found through the longest reading frame has a lower score in blastx.

How do I choose which protein it might be?

orf blast • 13k views

ADD COMMENT • link updated 13.7 years ago by Ketil 4.2k • written 14.4 years ago by User 3822 ▴ 60

0

Entering edit mode

Be careful with the phrase "lower score". A lower bit score indicates a "worse" hit, whereas a lower e-value indicates a "better" hit.

ADD REPLY • link 14.4 years ago by Neilfws 49k

0

Entering edit mode

Yes, I meant lower bit score.

ADD REPLY • link 14.4 years ago by User 3822 ▴ 60

0

Entering edit mode

Is the contig genomic or a transcript? After splicing?

ADD REPLY • link 14.4 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

I've found an interesting discussion here. I wonder typically what frame shift penalty value(s) for BlastX can be generally used.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 14.2 years ago by Woa ★ 2.9k

score 11 · Answer 1 · 2010-11-27

The key thing here is that both methods are giving you useful information: just not the same information.

Let's start with the blastx result. BLASTX is a crude, quick way to see if a nucleotide sequence has protein-coding potential. It simply translates in all 6 frames and compares the resulting sequences to a protein database. If your contig contains intron sequence or sequence errors then you will not see the "true", mature protein sequence, but you will get some idea that there is one in there somewhere.

ORF finders try to look for sequences that resemble true open reading frames with a start, stop, in-frame sequence in-between and perhaps other features. Traditionally (especially in prokaryotic genomics), short ORFs are discarded and the longest is chosen as the "best". However, it's important to remember that you are dealing with predictions, not experimental data. In addition, the quality of the contig sequence will have a large bearing on the quality of the predicted ORFs.

In summary, BLASTX is a "quick and dirty" test and an ORF finder should provide a "better" prediction of true ORFs. You would expect the BLAST scores to differ, since you are looking at slightly different sequences. But always remember that you are looking at computational predictions. Experimental validation (e.g. transcript sequencing) is the only way to determine a "true" ORF.

score 2 · Answer 2 · 2010-11-27

2

Entering edit mode

14.4 years ago

Aleksandr Levchuk 3.2k

Maybe you can also try to do gene prediction by using MAKER. It does several genome annotation steps among which is "producing ab-initio gene predictions". It uses many other software packages (SNAP, Augustus, GeneMark, ...) so it's a bit laborious to install.

ADD COMMENT • link 14.4 years ago by Aleksandr Levchuk 3.2k

score 2 · Answer 3 · 2010-11-27

It all depends on quality of your query (sequencing errors producing stop codons) and the level of simmilarity to various proteins in blastx. No matter if you sequenced genomic fragment or cDNA, you can still have retained intron / large ncRNA, chimeric clone to name the most obvious cases.

IMO in most cases, assuming strong blastx hits (>60% simmilarity, ca 40aa) you will be better starting with blastx. With vague short hits to DNA of dubious quality possibly also with repeats making a call "is it a gene?" is problematic. Having a long, non repetitive ORF even without strong blastx hits is then a good hint.

Tip: do not restrict yourself just to blastx. Tblastn with ESTs from close species or genomic alignment may resolve some tricky xons/less conserved parts of protein.

score 1 · Answer 4 · 2010-11-29

Well, the obvious explanation is that the longest ORF is not the right one. One explanation is that you have a sequencing error causing a frame shift. BLASTX isn't too great with frame shifts either, maybe you can check with another aligner?

When predicting ORFs, I use a dynamic programming algorithm that pieces together "compatible" BLASTX hits, and includes evidence like AUG, stop-codon and poly-A tail. I think this is better than just using longest ORF, I can dig up the graph comparing this if you're interested.