I'm trying to figure out the theoretical chances of variants in two sequences coinciding (in the same position) by chance.
Right now I'm just thinking about the simple case of two aligned sequences of length n (say, 200bp), with a known SNP rate of 0.001 per bp. It's interesting because it seems like a variant of the birthday paradox, but applying the analogy to the case of nucleotides and SNPs is less straightforward than I expected. I also could be wrong about the parallel.
First I'm just concentrating on figuring out the chances of there being a shared SNP at all, regardless of the base being the same. It seems I might be to calculate the probability of there not being a shared SNP pretty easily. Looking at any pair of aligned single nucleotides, the chances of them both being SNPs should be 0.001 * 0.001 = 0.000001. And the chance of them not both being a SNP is then 0.999999. So am I able to then say the chances of there not being a single shared SNP among all the n nucleotides is 0.999999^n?
Edit: I should make clear that I know this is loaded with assumptions that simply aren't true in reality, such as evenly distributed SNPs, unrelatedness of the individuals, etc. Which is why the usefulness of even calculating it is up for debate, but I'm trying to get a sense of the mathematical relationship, all things being equal, between SNP frequency, sequence length, and coincidental SNPs. This is, of course, the null hypothesis, whereas the alternative hypothesis is that the shared SNPs are due to homology.
Note also that the human reference genome itself contains rare SNPs. As a result, at these loci the probability that any two unrelated individuals will have the same non-reference base is very high (since the reference base is the rare allele, and the individuals simply have the common allele).
It only becomes a variant of the birthday paradox if you are asking if in a pool of sequences, there are two that share a SNP.