Why Markov model is depended to size of dataset?
1
0
Entering edit mode
8.1 years ago
Farbod ★ 3.4k

Dear Friends, Hi

I have used several programs (mentioned here) for finding potentially ORF and coding ability in some of my hit-less transcripts after performing BLAST.

Intrestingly (or according to bad-luck) there were no overlap between the results of those programs.

I have heard that most of these ORF finders are based on Markov model, which is trained based on the full data set and If we run it just based on a small set of sequences, it's not going to be trained properly and your false positive ORF prediction will be high.

1- Is this really the purpose of having no overlap between the results?

2- Why Markov model is depended to input size/dataset?

3- Isn't it analyse each sequence separately?

~ Thank you in advance

software error gene sequence HMM • 2.8k views
ADD COMMENT
1
Entering edit mode
8.1 years ago
Ram 44k

Markov Models learn from the training data set and apply that "knowledge" to your dataset. Like any statistical model, the power of the test goes up with the sample size. That being said, there is always a lower end to the sample size - something you cannot go under, because that would render the test meaningless.

The more practice the model has distinguishing actual results from coincidental outcomes, the better it should perform in non-training scenarios. The model you're using should give you a recommended number of training data points for efficient analysis - which is the least False Discovery Rate at the most optimal sensitivity.

I know this sounds vague - I hope someone can explain this in a better, more grounded fashion.

ADD COMMENT
0
Entering edit mode

Hi and thanks.

Imagine that we have only one transcript (or string of nucelotide sequence) and we want to check if it has the potential to code any protein (even theoretically),

Do we need to add a bunch of transcripts to it to received a more accurate answer ?

It is really bizarre !

ADD REPLY
1
Entering edit mode

Hi Farbod: As long as you have a DNA sequence you can translate it into a protein (jn all 6 frames if you want). I doubt there is any theoretical method that is going to give you a "confidence prediction" (setting aside similarity searches/modeling since we have gone over those already in other threads) that the peptide(s) you see is going to be actually present in your fish.

ADD REPLY
0
Entering edit mode

Exactly. When we were assembling transcriptomes, we would translate each putative transcript in all ORFs, pick the largest protein coding ORF and BLAST it against related organisms. (we actually pooled the transcripts and reciprocal BLAST-ed them to a related organism-db so we could be more confident)

ADD REPLY
0
Entering edit mode

OK,

I am working on BLAST-LESS transcripts.

ADD REPLY
0
Entering edit mode

What do you mean by BLAST-LESS transcripts?

ADD REPLY
0
Entering edit mode

I mean I performed the BLAST, those transcripts showed no hit. (hit-less)

ADD REPLY
0
Entering edit mode

Try a BLASTX against a relevant protein database.

ADD REPLY
0
Entering edit mode

I have done it against NCBI nr

ADD REPLY
0
Entering edit mode

It works better if your database has more of relevant sequences and not every single sequence in the known universe :)

ADD REPLY
0
Entering edit mode

have done it, before.

not much chance

ADD REPLY
0
Entering edit mode

Do you have any thoughts on why you don't see results?

ADD REPLY
0
Entering edit mode

Yes, I have assumed that (1) some of them are assembly/sequencing errors, and (2) maybe some of them are novel genes representative.

I intend to trap the second group using PCR.

For knowing if they worth to PCR, I begin with this point to check if they are coding.

please correct me if I miss somthing

ADD REPLY
0
Entering edit mode

Hi genomax2,

Please introduce me on a good software for "translate it into a protein".

I guess one of the best is Transdecoder.

And I want to know that which part of Morkove model formula is producing this restriction ?

ADD REPLY
3
Entering edit mode

The Viterbi algorithm is at the heart of hidden Markov models, which for many bioinformatics applications involves profile hidden Markov models where the "profile" is a multiple pairwise alignment such that we can get frequencies at each position of your sequence. The fewer example sequences you have for training, the less representative that profile is of the variance that is potentially present in the data you want to query.

Algorithms like HMMER use Laplace smoothing and add one to the denominator of that frequency calculation, so what you'll get when using a single sequence as training data is a 0.5 frequency at each position that identically matches your training sequence. Though it can be done this way and sometimes produces surprisingly accurate results, you're almost better off using BLAST-like methods unless your question involves querying for highly degenerate sequences.

ADD REPLY
0
Entering edit mode

Dear Steven Lakin, Hi and thank you for your complete and informative answer.

I have found this page about Hidden Markov Model, would you please tell me :

1- Is the softwares same as Transdecoder and EMBOSS-getORF use Markov model or Hidden Markov Model ?

2- at the wiki page I have mentioned above, which part of formula depends to "frequencies at each position of sequence" ? OR I must ask the same question for " Viterbi algorithm" ?

Take Care

ADD REPLY
0
Entering edit mode

EMBOSS' getorf can do this for you.

ADD REPLY
0
Entering edit mode

Dear Ram, Hi

Do you know any paper about comparison of ORF-finder softwares ?

ADD REPLY
0
Entering edit mode

No I don't, sorry.

I do have a question for you - why do we need probabilistic models for ORF prediction? Is the genetic code different for your species? Creating sets of 3 starting at seq[0],seq[1],seq[2],revcomp(seq)[0],revcomp(seq)[1] and revcomp(seq)[2], then translating them and finding the longest protein sounds like a pretty straightforward computation to me - am I missing something here?

ADD REPLY
1
Entering edit mode

@Farbod is trying to reduce the number of experiments that need to be the done (as best as I can tell).

At some point (bio)informatic options cease to provide useful hypotheses and one has to go back to the experimental bench to test/discredit hypotheses at hand. @Farbod seems to be having a hard time reconciling with that fact.

ADD REPLY
0
Entering edit mode

Dear genomax, Hi

I can not run PCR for 200 transcripts at now, So I need to choose some of them wisely. I begin with checking coding ability of these 200 string of nucleotides.

And your hypothesis about "having a hard time reconciling with that fact" is not true, but do not intend to spend money for nothing.

thank you anyway.

ADD REPLY
0
Entering edit mode

Hi Farbod: Following will sound unpleasant but there is no other way to say this. You are at a point where you need to go ahead and choose as many PCR's as you can afford to do and start some experiments. You should quickly get an answer to your question: Is it worth going forward with rest?

I suspect there is not much useful left on informatics end (after all the work you have put in) to help you narrow the selection that will have guaranteed success.

ADD REPLY
0
Entering edit mode

Makes sense. As an aside, is there any reason one would need an HMM to find ORFs?

ADD REPLY
0
Entering edit mode

Hi,

Unfortunately I am not familiar with this "seq[0],seq[1],seq[2],revcomp(seq)[0],revcomp(seq)[1] and revcomp(seq)[2]" approach.

ADD REPLY
0
Entering edit mode

What is an approach to finding ORFs that you are familiar with? Algorithm-wise, that is?

ADD REPLY

Login before adding your answer.

Traffic: 1288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6