I have three sets of multiple samples to compare. Those are basically the sequences of three samples and I need to compare them letter by letter (DNA sequences) and I was wondering if there is anyway I can show some concordance using any statistical methods or any codes/package in R. I don't have good knowledge of statistics and would really appreciate if any one can help me explain how to compare them.
Say I have three samples, sampleA, SampleB and SampleC. I want to compare the sequence concordance with the sets of each two and then all three samples (overall concordance). I just want to get the statistical significance for their concordance.
sampleA: AATGCCTGGAAA
sampleB: AATGCTTGGAAA
sampleC: TTTGCCTGG
Define sequence concordance (= identity?/conservation?)! To me this looks like a common multiple/global alignment problem or maybe to determine regions under selection, e.g. ka/ks. To calculate significance you need a 0-model, that means a model that explains what happens just by chance. You need to define that too.
As you see, you need to present your problem in full detail including where your samples come from and what the biological question really is, and also avoid to impose a potential solution ("I need to compare them letter by letter" (implies global alignment end to end), "R-package") in the problem description.
Also, you example is most likely not representative.
Most likely there is already a well established framework in genetics that covers it.
Regarding statistics of alignments maybe this helps: https://www.ncbi.nlm.nih.gov/BLAST/tutorial/
Accordingly, approximate p-values of global alignments are best determined from simulations. Is it that what you are aiming at?
Ok, Thanks.
Why did you delete the original content of that comment? It looked like it was more informative...