Can anyone pls explain to me what exactly the genotype information is? I need to compare allele frequencies (or smthing like that about genotype) from 2 simulation scenarios which are written in a very specific format generated by mcms tool.
Here is an example of a file:
ms 5 3 -t 5 -N 5000000 -I 3 5 0 0 -es 1.5 1 0.5 -es 7.5 1 0.5 [3.2rc Build:162]
0xebfc42a7c6a366af
//
segsites: 7
positions: 0.04375 0.04618 0.28698 0.68964 0.75697 0.77997 0.87296
0000100
0000100
1111011
0000100
0000100
//
segsites: 25
positions: 0.03118 0.08248 0.13261 0.24075 0.34263 0.40965 0.44703 0.49119 0.55406 0.55828 0.55906 0.58059 0.58536 0.63079 0.63707 0.65367 0.67934 0.72050 0.75345 0.78975 0.81020 0.83922 0.88021 0.94746 0.95751
0000100100010001001111001
0000000100111000001110001
0110000100010100001110001
0000000100111000001110001
1001011011000010110000110
//
segsites: 6
positions: 0.07350 0.15611 0.37691 0.80368 0.98965 0.99393
100000
001011
001111
001011
010000
Thank you for your reply! I really appreciate the help, because i've been trying to understand the msms programm for some time now.
According to manual, the second line isn't exactly allele frequencies - it is the positions of segregation sites, distributed on the (0,1) interval, respectively to the segsites on an actual simulated allele.
Though, can you help me a little more, and explain how do i get allele frequency from a genotype (which is actully a haplotype) like this? The manual says
"the haplotypes of each of the sampled chromosomes are given, each as a string of zeros and ones. The ancestral state is coded with a zero, and the mutant, or derived state, indicated with a one"
Whats the procedure to get allele frequency? i'm quety confused
Allele frequency is calculated based on genotypes in a column (one column is a segregation site). For example, for the first site (out of 7 sites) in the first simulation, allele frequency of genotype 1 will be 20% (1 out of 5) and frequency of genotype 0 will be 80% (4 out of 5). For haplotype frequency, it's a bit more complicated, you'll have to look at all sites. Again, using the first simulation as an example, there are two different unique haplotypes in this data set: 0000100, 1111011. Their haplotype frequency is 80% (4 out of 5) and 20% (1 out of 5), respectively.
Thanks a lot! that really helped!