I am working on 1000Genome data. I'd like to find for every population SNPs that are only found in a selected population (population private SNPs). Now, how I'd go about it is to recursively find the difference between sets of SNPs in different populations say for YRI and LWK, I'd get all the SNPs in YRI and filter out the SNPs that are shared between YRI and LWK. I'd repeat the exercise for the other populations. I tend to think that this kind of a functionality would have been implemented in one of the VCF analysis tools or genome analysis software if you know of a command or pipeline that implements this functionality please let me know. I could code up the solution but it'd save me a great deal of time if I could avoid redundancy.
Don't know about exiting tools. But we could get frequency per population, then use set operations to get SNP lists?
^^ It does indeed seem to be as straight forward as how zx8754 describes. The allele frequency data can be used to infer alleles that are only present in one population group or another. If I was actively working on this, I would spend some time to get the 1000 Genomes data into a single BCF and also a PLINK dataset, where it would then be easier to work with it.
Thanks, I have one more question. I split the bed files by sub-population by running
plink --bfile <MyFile.bed> --keep </path/to/sample/ids>
. How do I test whether the allele frequencies are different across populations. I am considering 7 sub-populations of the 1000Genome data set for my analysis. I believe that I'll need to build a phenotype file for this. I am not clear on how to build the file and run it onplink
. I would appreciate a format of the file and possibly plink commands.