I have a matrix in which the rows are isolates and columns are nucleotides at select sites where homozygous variation has been detected. Is there a way to do an Fst test? I can export this matrix into R. I never done an Fst before.
Thank you!
Update: My data consists in 6 isolates, and for every isolate, I have a vcf file, indicating variants regarding a the genome reference. So it looks something like this:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT whatever
Sample1 8139885 . A G 591.03 . AB=0.342857;ABP=18.0245;AC=1;AF=0.25;AN=4;AO=24;CIGAR=1X;DP=70;DPB=3323;DPRA=0;EPP=46.8017;EPPR=94.401;GTI=0;HWE=-0;LEN=1;MEANALT=1;MQM=255;MQMR=255;NS=1;NUMALT=1;ODDS=3.62626;PAIRED=1;PAIREDR=1;PAO=6.95324e-310;PQA=0;PQR=0;PRO=6.95324e-310;QA=920;QR=1770;RO=46;RPP=46.8017;RPPR=94.401;RUN=1;SAP=55.1256;SRP=102.898;TYPE=snp;XAI=0.00803798;XAM=0.0305247;XAS=0.0224867;XRI=0.00860706;XRM=0.0107998;XRS=0.00219274;technology.illumina=1;BVAR GT:DP:RO:QR:AO:QA 0/0/0/1:70:46:1770:24:920
This corresponds to one position where a variant has been found. The 6 files have a list of variants present in them, compared to the reference genome. As you can see, it tells me that A and G at that location are present in about a 2 to 1 ratio, since there are 46 observations for A and 24 for G, and the algorithm approximates the Frequency of the alternate allele G to be 0.25.
That being said, this is an observation for the entire population being sequenced by NGS. If the organism is tetraploid, my conclusion is that all individuals have the G allele is one out of 4 chromatids, and A in 3 out of 4 chromatids. There is not much more I can say here, is there? I do not know how many are heterozygous A/G or homozygous A or homozygous G and so on. I just know the frequency of allele A and frequency of allele G.
the worked example I always tend to explain is this one, which shows that to compute Fst you'll need to compute expected and observed heterozygosities first. but without a few lines of that input file it's complicated to suggest anything.
I looked through your example. In my population, I do not have information regarding frequencies of genotypes (i.e. how many AA, how many Aa, and how many aa). I only have information on what are the heterozygous loci in one population, and than I use that to compare to heterozygous loci in another population.
I have 6 isolates, but the information that I am able to extract, from NGS of the entire population, is limited. If I see in one population a 50/50 ratio at one locus for 2 different basepairs A and G, I assume that all individuals are heterozygous, with A in one chromatid, and G in the second chromatid.
How do I proceed from this?
Fst can be understood as a measurement of how the heterozygosity of a particular marker behaves in different populations by comparing expected and observed values. for that reason you need to work with raw genotypes, and again it's not clear how your raw data looks like. as a side note, be aware that NGS is a technique that favours homo over hetero site detection, which may affect the Fst evaluation.
Sorry, I guess I should have given an example of my data sooner, but I posted how my data is organized. On another note, would you please let me know where can I find more information about the bias NGS introduces for homozygous variants? Thank you for your help.