Question

allele frequencies, genotype

0

Entering edit mode

8.0 years ago

prostoesh ▴ 20

Can anyone pls explain to me what exactly the genotype information is? I need to compare allele frequencies (or smthing like that about genotype) from 2 simulation scenarios which are written in a very specific format generated by mcms tool.

Here is an example of a file:

ms 5 3 -t 5 -N 5000000 -I 3 5 0 0 -es 1.5 1 0.5 -es 7.5 1 0.5  [3.2rc Build:162]
0xebfc42a7c6a366af

//
segsites: 7
positions: 0.04375 0.04618 0.28698 0.68964 0.75697 0.77997 0.87296
0000100
0000100
1111011
0000100
0000100

//
segsites: 25
positions: 0.03118 0.08248 0.13261 0.24075 0.34263 0.40965 0.44703 0.49119 0.55406 0.55828 0.55906 0.58059 0.58536 0.63079 0.63707 0.65367 0.67934 0.72050 0.75345 0.78975 0.81020 0.83922 0.88021 0.94746 0.95751
0000100100010001001111001
0000000100111000001110001
0110000100010100001110001
0000000100111000001110001
1001011011000010110000110

//
segsites: 6
positions: 0.07350 0.15611 0.37691 0.80368 0.98965 0.99393
100000
001011
001111
001011
010000

genome SNP sequence • 1.6k views

ADD COMMENT • link updated 5.8 years ago by Biostar 20 • written 8.0 years ago by prostoesh ▴ 20

score 1 · Answer 1 · 2016-12-09

1

Entering edit mode

8.0 years ago

Vitis ★ 2.6k

Looks like the first line tells the number of segregation sites, which are DNA sites in the genome showing a variant (SNP/indel) in the simulated population. The second line looks like allele frequencies at each site, and the remaining lines are genotypes (0 as A and 1 as a, or some other coding scheme). I wonder why the allele frequencies are not calculated directly from the genotypes, it must have been corrected from some models in your simulation. I strongly suggest you to read the mcms manual and paper before looking at the results.

ADD COMMENT • link 8.0 years ago by Vitis ★ 2.6k

0

Entering edit mode

Thank you for your reply! I really appreciate the help, because i've been trying to understand the msms programm for some time now.

According to manual, the second line isn't exactly allele frequencies - it is the positions of segregation sites, distributed on the (0,1) interval, respectively to the segsites on an actual simulated allele.

Though, can you help me a little more, and explain how do i get allele frequency from a genotype (which is actully a haplotype) like this? The manual says

"the haplotypes of each of the sampled chromosomes are given, each as a string of zeros and ones. The ancestral state is coded with a zero, and the mutant, or derived state, indicated with a one"

Whats the procedure to get allele frequency? i'm quety confused

ADD REPLY • link 8.0 years ago by prostoesh ▴ 20

1

Entering edit mode

Allele frequency is calculated based on genotypes in a column (one column is a segregation site). For example, for the first site (out of 7 sites) in the first simulation, allele frequency of genotype 1 will be 20% (1 out of 5) and frequency of genotype 0 will be 80% (4 out of 5). For haplotype frequency, it's a bit more complicated, you'll have to look at all sites. Again, using the first simulation as an example, there are two different unique haplotypes in this data set: 0000100, 1111011. Their haplotype frequency is 80% (4 out of 5) and 20% (1 out of 5), respectively.