Dear Friends, Hi
I have used several programs (mentioned here) for finding potentially ORF and coding ability in some of my hit-less transcripts after performing BLAST.
Intrestingly (or according to bad-luck) there were no overlap between the results of those programs.
I have heard that most of these ORF finders are based on Markov model, which is trained based on the full data set and If we run it just based on a small set of sequences, it's not going to be trained properly and your false positive ORF prediction will be high.
1- Is this really the purpose of having no overlap between the results?
2- Why Markov model is depended to input size/dataset?
3- Isn't it analyse each sequence separately?
~ Thank you in advance
Hi and thanks.
Imagine that we have only one transcript (or string of nucelotide sequence) and we want to check if it has the potential to code any protein (even theoretically),
Do we need to add a bunch of transcripts to it to received a more accurate answer ?
It is really bizarre !
Hi Farbod: As long as you have a DNA sequence you can translate it into a protein (jn all 6 frames if you want). I doubt there is any theoretical method that is going to give you a "confidence prediction" (setting aside similarity searches/modeling since we have gone over those already in other threads) that the peptide(s) you see is going to be actually present in your fish.
Exactly. When we were assembling transcriptomes, we would translate each putative transcript in all ORFs, pick the largest protein coding ORF and BLAST it against related organisms. (we actually pooled the transcripts and reciprocal BLAST-ed them to a related organism-db so we could be more confident)
OK,
I am working on BLAST-LESS transcripts.
What do you mean by
BLAST-LESS
transcripts?I mean I performed the BLAST, those transcripts showed no hit. (hit-less)
Try a BLASTX against a relevant protein database.
I have done it against NCBI nr
It works better if your database has more of relevant sequences and not every single sequence in the known universe :)
have done it, before.
not much chance
Do you have any thoughts on why you don't see results?
Yes, I have assumed that (1) some of them are assembly/sequencing errors, and (2) maybe some of them are novel genes representative.
I intend to trap the second group using PCR.
For knowing if they worth to PCR, I begin with this point to check if they are coding.
please correct me if I miss somthing
Hi genomax2,
Please introduce me on a good software for "translate it into a protein".
I guess one of the best is Transdecoder.
And I want to know that which part of Morkove model formula is producing this restriction ?
The Viterbi algorithm is at the heart of hidden Markov models, which for many bioinformatics applications involves profile hidden Markov models where the "profile" is a multiple pairwise alignment such that we can get frequencies at each position of your sequence. The fewer example sequences you have for training, the less representative that profile is of the variance that is potentially present in the data you want to query.
Algorithms like HMMER use Laplace smoothing and add one to the denominator of that frequency calculation, so what you'll get when using a single sequence as training data is a 0.5 frequency at each position that identically matches your training sequence. Though it can be done this way and sometimes produces surprisingly accurate results, you're almost better off using BLAST-like methods unless your question involves querying for highly degenerate sequences.
Dear Steven Lakin, Hi and thank you for your complete and informative answer.
I have found this page about Hidden Markov Model, would you please tell me :
1- Is the softwares same as Transdecoder and EMBOSS-getORF use Markov model or Hidden Markov Model ?
2- at the wiki page I have mentioned above, which part of formula depends to "frequencies at each position of sequence" ? OR I must ask the same question for " Viterbi algorithm" ?
Take Care
EMBOSS' getorf can do this for you.
Dear Ram, Hi
Do you know any paper about comparison of ORF-finder softwares ?
No I don't, sorry.
I do have a question for you - why do we need probabilistic models for ORF prediction? Is the genetic code different for your species? Creating sets of 3 starting at
seq[0]
,seq[1]
,seq[2]
,revcomp(seq)[0]
,revcomp(seq)[1]
andrevcomp(seq)[2]
, then translating them and finding the longest protein sounds like a pretty straightforward computation to me - am I missing something here?@Farbod is trying to reduce the number of experiments that need to be the done (as best as I can tell).
At some point (bio)informatic options cease to provide useful hypotheses and one has to go back to the experimental bench to test/discredit hypotheses at hand. @Farbod seems to be having a hard time reconciling with that fact.
Dear genomax, Hi
I can not run PCR for 200 transcripts at now, So I need to choose some of them wisely. I begin with checking coding ability of these 200 string of nucleotides.
And your hypothesis about "having a hard time reconciling with that fact" is not true, but do not intend to spend money for nothing.
thank you anyway.
Hi Farbod: Following will sound unpleasant but there is no other way to say this. You are at a point where you need to go ahead and choose as many PCR's as you can afford to do and start some experiments. You should quickly get an answer to your question: Is it worth going forward with rest?
I suspect there is not much useful left on informatics end (after all the work you have put in) to help you narrow the selection that will have guaranteed success.
Makes sense. As an aside, is there any reason one would need an HMM to find ORFs?
Hi,
Unfortunately I am not familiar with this "seq[0],seq[1],seq[2],revcomp(seq)[0],revcomp(seq)[1] and revcomp(seq)[2]" approach.
What is an approach to finding ORFs that you are familiar with? Algorithm-wise, that is?