Using my very basic probability skills, I will try to explain dsull comments with a little more detail.
You can calculate pretty easily the probability of the sequence being correct, i.e., no base is wrong, as (probability the base is correct) ^ #bases. So, for example, for a 100bp sequence with all bases Q20, the probability of the sequence being correct is (0.99)^(100) = 0.366
, or 36.6% chance having no errors.
The probability of a sequence "being wrong" is one minus the probability of the sequence being correct. So, for the same example above of a 100bp sequence with all bases Q20, the probability of the sequence being wrong (i.e., containing one or more errors) is 1-0.366 = 0.634
, or 63.4% chance of containing at least one error.
Note there is only one way of a sequence is correct (all bases must be correct), but there are many ways a sequence can be wrong - one base can be wrong, two bases, and so on. The estimation from your question - (0.01)^100
- is actually the probability of all bases of the sequence being wrong.
No, let's assume that the sequencer has a 1% chance of error. So if 100 nucleotides are sequenced, it is expected that one of them will have an error.
You'd have to do 1 - (0.99^100) to calculate your desired probability, not (0.01)^100
Going by this, it would mean a 250 bp read would have 1-((0.99)^250) ~ 92% chance of being wrong if all bases in the sequence had a
PHRED
score of 20? Or did you mean to say this is the probability of the sequence being not wrong?I suppose I am misunderstanding something here.
What do you consider "wrong"? My definition of wrong is if a sequence differs from the ground truth even by one base.
So yes, your 250 bp read would have a 92% chance of being wrong (i.e. differing from the true sequence by at least one base).
A 1% error rate is pretty high (it's 1 in 100 bases basically; and now you're giving me a read with 250 bases, so of course it will most likely contain an error).
I was actually tripping over precisely what @h.mon explained in their answer: "Note there is only one way of a sequence is correct (all bases must be correct), but there are many ways a sequence can be wrong".
Now your answer makes sense to me!!
Phred score is for each individual basecall. I don't think it can be extended to a stretch of sequence (just intuition not a statistician).
^^^ yes, these are usually quoted at the level of an individual base.
Phred scores are also calculated for the alignment of a read to a reference though, i.e., MAPQ score.