Could someone please tell me how probable the following event is?
*Two duplicated regions (800 base pairs long) are localized very closely and should have similar sequences (i.e. 80%-90% identity) but not totally the same. However they show same sequences because of sequencing or assembly error.*
Oh, sorry for that I should have said that it might be also due to assembly error. So if this is taken into account, how likely do you expect this event will be? Thanks!
The best you can do in such cases is a "back-of-the-envelope" estimation.
Say the likelihood of any base being affected by a sequencing error is 1/100 then the likelihood that a chosen base ends up being changed to a given value just by sequencing error would be 1/400. From that the likelihood that all 64 bases (80% of 800) get all changed would be p = (1/400) ^ 64
This of course is for a single region, if you now had many regions you would need to apply a multiple testing correction.
Regardless suffice to say the chance of this happening is zero
Cannot agree enough with Istvan. To expound a bit, at least on the sequencing side of things:
Let's also bear in mind that if this is DNA sequencing, you'll likely have more than one read covering any base in the region. In that case, you're not only requiring a sequencing error to happen that makes the two regions more alike, you're also requiring that the same error is occurring in every single read that you're observing covering that particular base. If we assume a base has 30 reads covering it, that would mean that before multiple correction, the chance of a sequencing error occurs at every observation of a read is something around (1/400)^30, which has a likelihood of ~10E-79.
If the sequences are distinguished by indels as well, the chance that a mutation to a particular region will make the duplicated sequences look more alike can change dramatically, since you're not restricted to 3 alternatives--an indel can add or remove any number of bases, and getting an indel that makes two regions indistinguishable is less likely if the region is stable w.r.t indels. Or more likely, depending on if the region is prone to indels, i.e. STRs.
Your statement has the implicit assumption that the only sequencing and assembly errors that are occurring make the two regions look more alike. There's no reason to assume that w.r.t. sequencing, there aren't errors occurring that make the sequences look less alike as well. In fact, given 80% identity, it's more likely that a sequencing error will make a segment of the duplicated regions that were previously identical look different.
Finally, I'm no assembly expert, but sequencing errors can usually be filtered from an assembly using a kmer filter--counting the number of times a particular kmer occurs should alert you to rare kmers that likely represent errors.
All this is not to say that the two duplicated regions won't be very difficult to distinguish in assembly. I don't know if 80% identity is enough to distinguish two regions, but my intuition is that they will be difficult to distinguish in assembly primarily due to their inherent structural similarities, negligibly because of sequencing errors.
Bold is just there in case I'm too verbose, to draw out the main points I'm trying to make.
It depends if we really talk about an assembly error here. Tandem duplications are quite common and it is usually tricky for genome assemblers to assemble them correctly. So yes, in this case it could be a probable event.
When you say "they show same sequences", do you really mean because of a sequencing error, or maybe because of an assembly error?
Oh, sorry for that I should have said that it might be also due to assembly error. So if this is taken into account, how likely do you expect this event will be? Thanks!